Not PIC-related but I'm hoping some of you clever people can help me out with this programming challenge. I need to find a way to compare strings and find "possible matches". The scenario is this:
I have a list of company names which have been entered with varying degrees of accuracy. I need to be able to take a name (which itself may have not been entered correctly) and see if that company is already in the list. Because of the data-entry errors, looking for an exact match won't work. I need to do a fuzzy search and return any potential matches so the user can select from a short-list of possible candidates.
The company names are generally, but not always, multiple words long - i.e. "Beijing New Century Property Management Company". The total list of names runs to no more than 5,000 entries.
The research I've done so far has turned up two possible approaches:
1) Use Soundex searches. (http://en.wikipedia.org/wiki/Soundex)
The idea would be to store the soundex numbers for each company name (word by word). When a new name comes in, convert it to soundex and then search for matches. Any company with more than x% of its words matching the name being searched is flagged as a possible match.
2) Use a Bitap algorithm. (http://en.wikipedia.org/wiki/Bitap_algorithm
This looks like it might be what I need but I'm struggling to find a decent explanation of how it works! Unfortunately, the example code given in the Wikipedia article is in C so I can't make head nor tail of it. Does anyone know where I could find a simple explanation? Is anyone able to convert the "Fuzzy Searching" C code in the Wikipedia article into pseudo-code so I can see what's going on?
Any help or advice greatly appreciated. I'm not married to either of these two approaches so if someone knows a better way then please let me know.