The Lounge

Fuzzy String Matching?

In this forum you can chat about anything you like.

Fuzzy String Matching?

Postby AndyO » Tue Nov 30, 2010 7:37 am

Not PIC-related but I'm hoping some of you clever people can help me out with this programming challenge. I need to find a way to compare strings and find "possible matches". The scenario is this:

I have a list of company names which have been entered with varying degrees of accuracy. I need to be able to take a name (which itself may have not been entered correctly) and see if that company is already in the list. Because of the data-entry errors, looking for an exact match won't work. I need to do a fuzzy search and return any potential matches so the user can select from a short-list of possible candidates.

The company names are generally, but not always, multiple words long - i.e. "Beijing New Century Property Management Company". The total list of names runs to no more than 5,000 entries.

The research I've done so far has turned up two possible approaches:

1) Use Soundex searches. (http://en.wikipedia.org/wiki/Soundex)
The idea would be to store the soundex numbers for each company name (word by word). When a new name comes in, convert it to soundex and then search for matches. Any company with more than x% of its words matching the name being searched is flagged as a possible match.

2) Use a Bitap algorithm. (http://en.wikipedia.org/wiki/Bitap_algorithm
This looks like it might be what I need but I'm struggling to find a decent explanation of how it works! Unfortunately, the example code given in the Wikipedia article is in C so I can't make head nor tail of it. Does anyone know where I could find a simple explanation? Is anyone able to convert the "Fuzzy Searching" C code in the Wikipedia article into pseudo-code so I can see what's going on?

Any help or advice greatly appreciated. I'm not married to either of these two approaches so if someone knows a better way then please let me know.

Cheers,
Andy
User avatar
AndyO
VIP
VIP
 
Posts: 238
Joined: May 2008
Location: Beijing, China
Has thanked: 18 times
Have thanks: 20 times

Re: Fuzzy String Matching?

Postby AndyO » Wed Dec 01, 2010 6:32 am

Following on from this I've done a bit more reading into soundex. I had thought it would help reduce missed matches caused by typos but I was wrong there. It seems that simple typos can cause a word's soundex code to be very different.

So that approach is out... Back to the drawing board.
Andy
User avatar
AndyO
VIP
VIP
 
Posts: 238
Joined: May 2008
Location: Beijing, China
Has thanked: 18 times
Have thanks: 20 times

Re: Fuzzy String Matching?

Postby jmessina » Wed Dec 01, 2010 9:33 am

It seems that simple typos can cause a word's soundex code to be very different.


I think that's true, Andy, but you're probably more interested in the soundex Difference() function to help determine how close a match there might be. It's all pretty subjective.

I've never done this, and supposedly there are better algorithms than soundex, but it doesn't sound too bad (no pun intended), and you've got to consider the horsepower available (are you talking about running this on a pic?)
Random avatar
jmessina
VIP
VIP
 
Posts: 877
Joined: June 2010
Has thanked: 7 times
Have thanks: 174 times

Re: Fuzzy String Matching?

Postby AndyO » Wed Dec 01, 2010 9:49 am

are you talking about running this on a pic?
Goodness, no! It would be running on a PC. I'm at the very early stages of thinking about this - I've not had sight of the actual data yet. Once I see what sort of state it's in then I should have a clearer idea of what the errors look like and how I might be able to handle them - if the typos are few and far between then perhaps soundex would be ok.
Andy
User avatar
AndyO
VIP
VIP
 
Posts: 238
Joined: May 2008
Location: Beijing, China
Has thanked: 18 times
Have thanks: 20 times

Re: Fuzzy String Matching?

Postby jmessina » Wed Dec 01, 2010 10:24 am

Goodness, no! It would be running on a PC...

Not PIC-related...


Guess I should have read a little closer... perhaps it's time to clean my glasses.

Keep us posted on how it goes. It'll be interesting to see what you end up with.
Random avatar
jmessina
VIP
VIP
 
Posts: 877
Joined: June 2010
Has thanked: 7 times
Have thanks: 174 times


Return to The Lounge