![]() |
fuzzy match
Somebody posted something like this last week and it was never solved. Is
this too much for Excel? A1 thru An contain the names of companies. B1 thru Bn contain another list of companies. Each list was created by different people at different times from different sources. Most of the names in the first list should be in the second list. Is it possible to match the names to get something showing the % match? Maybe counting the number of consecutive letters that match even if they're in a different position or something. Or maybe ignore things like "Inc." So the end result would look like this: Company_list_1 Company_list_2 Matches with % AT&T AT&T B2 100% IBM Wal-Mart B4 100% WalMart IBM, Inc B3 90% Sears Ford Ford Motors Southwest Airlines B4 40% |
fuzzy match
This is tough because of the different ways that different (and even the
same) people do type the same thing in.........it will take some creative effort........you can use the Edit Replace feature to eliminate things like the "Inc.", by just replacing them with nothing (leave that window blanik)........you can show the number of characters in each cell by =LEN(A1) and copying it down....you can also use the Data TextToColumns SpaceDelimited to separate out each term into it's own column, and the LEFT and RIGHT and VLOOKUP formulas might help separate things out further, and even AutoFilter using "contains" as a Custom Selection might help..........there's no magic bullet for this kind of thing, and one must have the data in front of them to do much good, and even then, it's hit and miss.......... Also, maybe there is a Street Address column or phone number column or Vendor number column that might have the same numbers in it for the same company no matter how the company name is spelled hth Vaya con Dios, Chuck, CABGx3 "Marcy" wrote: Somebody posted something like this last week and it was never solved. Is this too much for Excel? A1 thru An contain the names of companies. B1 thru Bn contain another list of companies. Each list was created by different people at different times from different sources. Most of the names in the first list should be in the second list. Is it possible to match the names to get something showing the % match? Maybe counting the number of consecutive letters that match even if they're in a different position or something. Or maybe ignore things like "Inc." So the end result would look like this: Company_list_1 Company_list_2 Matches with % AT&T AT&T B2 100% IBM Wal-Mart B4 100% WalMart IBM, Inc B3 90% Sears Ford Ford Motors Southwest Airlines B4 40% |
fuzzy match
Ok, just building on Chuck's comments, you could do Edit | Replace
repeatedly to get rid of spaces and punctuation symbols like hyphens (better to copy column A to a new column B and carry out these actions on B and C, so that you have the original column A to fall back on). You can convert all characters to upper case with =UPPER(B1) etc. You can sort both columns B and C independently to get all names beginning with "A" etc. together. Scanning through B and C, you can insert a cell in column C to move the rest of the cells in that column down whenever you come across an entry in B which is not in C (and vice-versa, though you have to insert a cell in both A and B to keep them lined up). There is no easy formulaic way to achieve this. Hope this helps. Pete |
fuzzy match
You can try the find function and use the asterisk wild card character as the
first and last entry to do the closest thing to a "like" search. *Find Text* -- Kevin Backmann "Marcy" wrote: Somebody posted something like this last week and it was never solved. Is this too much for Excel? A1 thru An contain the names of companies. B1 thru Bn contain another list of companies. Each list was created by different people at different times from different sources. Most of the names in the first list should be in the second list. Is it possible to match the names to get something showing the % match? Maybe counting the number of consecutive letters that match even if they're in a different position or something. Or maybe ignore things like "Inc." So the end result would look like this: Company_list_1 Company_list_2 Matches with % AT&T AT&T B2 100% IBM Wal-Mart B4 100% WalMart IBM, Inc B3 90% Sears Ford Ford Motors Southwest Airlines B4 40% |
fuzzy match
These might get you close to what you want.
=TEXT(SUMPRODUCT(ROW($A$2:$A$6),--(((SUBSTITUTE(A2,"-","")=SUBSTITUTE($B$2:$B$6,"-",""))+IF(ISERROR(FIND(A2,$B$2:$B$6)),0,1)+IF(ISER ROR(FIND($B$2:$B$6,A2)),0,1))0)),"B#;;") copy it into the cell and confirm it with ctrl+shift+enter. Brackets will automatically be entered to show it is an array function. This function basically does three comparisons. 1. Compares the A2 (minus "-") with each item in B (minus "-"). Example: Walmart and Wal-Mart. 2. Compares A2 to see if any part of each item in B contains A2. Example: IBM and IBM, Inc. 3. Compares each item of B to see if it is part of A2. Example: Ford Motors and Ford =IF(C2="","",MIN(LEN(A2)/LEN(INDIRECT(C2)),LEN(INDIRECT(C2))/LEN(A2))) this simply compares the length of the two names (slightly different than what you have). They result in something like this... Company_list_1 Company_list_2 Matches_With % AT&T AT&T B2 100% IBM Wal-Mart B4 33% WalMart IBM, Inc. B3 88% Sears Ford Ford Motors Southwest Airlines B5 36% If you have more than two items that are close to each other, then the output might be incorrect. Your problem is with the versatility you want not with excel. You need to figure out exactly what constitutes a match, exactly what can be ignored. It would be a lot easier if you could do a find and replace, and remove anything you want ignored. For instance enter - in the find (or ", inc.") and nothing in the replace, and then hit replace all. This might not be exactly what you want, but I hope it will get you close to it. "Marcy" wrote: Somebody posted something like this last week and it was never solved. Is this too much for Excel? A1 thru An contain the names of companies. B1 thru Bn contain another list of companies. Each list was created by different people at different times from different sources. Most of the names in the first list should be in the second list. Is it possible to match the names to get something showing the % match? Maybe counting the number of consecutive letters that match even if they're in a different position or something. Or maybe ignore things like "Inc." So the end result would look like this: Company_list_1 Company_list_2 Matches with % AT&T AT&T B2 100% IBM Wal-Mart B4 100% WalMart IBM, Inc B3 90% Sears Ford Ford Motors Southwest Airlines B4 40% |
fuzzy match
Your problem is of course impossible to solve perfectly. However, you might
consider using a Soundex algorithm when the issue is misspelling. It takes a character string and returns a value of the sound of that word in English. It would not be a complete solution, but might be part of a larger solution. Here's a Soundex implementation in VBA: http://j-walk.com/ss/excel/tips/tip77.htm Bill ----------------------------- Marcy wrote: Somebody posted something like this last week and it was never solved. Is this too much for Excel? A1 thru An contain the names of companies. B1 thru Bn contain another list of companies. Each list was created by different people at different times from different sources. Most of the names in the first list should be in the second list. Is it possible to match the names to get something showing the % match? Maybe counting the number of consecutive letters that match even if they're in a different position or something. Or maybe ignore things like "Inc." So the end result would look like this: Company_list_1 Company_list_2 Matches with % AT&T AT&T B2 100% IBM Wal-Mart B4 100% WalMart IBM, Inc B3 90% Sears Ford Ford Motors Southwest Airlines B4 40% |
All times are GMT +1. The time now is 07:18 PM. |
Powered by vBulletin® Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
ExcelBanter.com