Home |
Search |
Today's Posts |
|
#1
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
Good question. I probably should buy a book or take a class on natural
language processing. But until I can do that, here is what I have so far. I added a penalty for strings that have many of the same words but are significantly different in length. Private Function RateLabel(str1 As String, str2 As String) As Integer Dim i As Integer, j As Integer, intMatch As Integer, intRating As Integer, str1Array, str2Array, strTemp If str1 < "" And str2 < "" Then If StrComp(str1, str2, vbTextCompare) = 0 Then RateLabel = 100 Else str1 = Application.WorksheetFunction.Trim(str1) str2 = Application.WorksheetFunction.Trim(str2) str1Array = Split(str1, " ") str2Array = Split(str2, " ") 'make sure str1Array is always the smaller of the two If UBound(str1Array) UBound(str2Array) Then strTemp = str1Array str1Array = str2Array str2Array = strTemp End If 'count words and determine % that match For i = 0 To UBound(str1Array) For j = 0 To UBound(str2Array) If StrComp(str1Array(i), str2Array(j), vbTextCompare) = 0 Then intMatch = intMatch + 1 Exit For End If Next j Next i intRating = (intMatch / (UBound(str1Array) + 1)) * (100 - 8 * (UBound(str1Array) - UBound(str2Array))) If intRating 0 Then RateLabel = intRating Else RateLabel = 0 End If End If MsgBox "str1 = " & str1 & ", str2 = " & str2 & ", rating = " & RateLabel End Function |
#2
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
Hi,
google for Levenshtein distance. There are other algorithms, which may return a correlation coefficient that matches better the intuition of an ordinary language user, one of these algorithms by me, but far too complicated and far too slow. :-( -- Greetings from Bavaria, Germany Helmut Weber, MVP WordVBA Win XP, Office 2003 "red.sys" & Chr$(64) & "t-online.de" |
#3
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]() |
#4
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
For your original example, 50% of the words match (3 of 6). It doesn't
catch "play" and "character" because they both appear once with a question mark attached and once without. So I need to improve how it handles punctuations. Anyway, 50 would be the value (50% of the words match), but since the strings are different lengths, it discounts the score via the line: intRating = (intMatch / (UBound(str1Array) + 1)) * (100 - 8 * (UBound(str1Array) - UBound(str2Array))) Although I made a mistake in this line, which is why you get 58 instead of 42. It should read: intRating = (intMatch / (UBound(str1Array) + 1)) * (100 - 8 * (UBound(str2Array) - UBound(str1Array))) Thanks for pointing that out. To give you some background, I have several tables of data I'm trying to merge, but the labels are sometimes different even though it's really the same line. The data itself is similar: close, but not always the same. Since there are thousands of tables, I need a way for the computer to decide whether the lines in two tables really refers to the same thing. So my approach has been to rate how close the line in one table is to the line in another (both label text and table data), so that they can be merged without having to do it manually. |
#6
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
Thanks Helmut. That is a great start. I need to find something a
little more advanced than Levenshtein, so that "What are you?" and "You are what?" get a higher score. But a great start, thanks. If you have any more recommendations on search/language processing, I would love to look into them. |
#7
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
Googling for
string similarity returns thousands of hits. -- Greetings from Bavaria, Germany Helmut Weber, MVP WordVBA Win XP, Office 2003 "red.sys" & Chr$(64) & "t-online.de" |
Reply |
Thread Tools | Search this Thread |
Display Modes | |
|
|
![]() |
||||
Thread | Forum | |||
How to compare similar entries, NOT duplicated ones? | New Users to Excel | |||
Find Similar Words In An Excel Document | New Users to Excel | |||
How do I compare similar data in two separate workbooks? | Excel Discussion (Misc queries) | |||
Compare 2 similar excel worksheets | Excel Worksheet Functions |