Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 22
Default text compare, percentage of similar words

Good question. I probably should buy a book or take a class on natural
language processing. But until I can do that, here is what I have so
far. I added a penalty for strings that have many of the same words
but are significantly different in length.

Private Function RateLabel(str1 As String, str2 As String) As Integer
Dim i As Integer, j As Integer, intMatch As Integer, intRating As
Integer, str1Array, str2Array, strTemp
If str1 < "" And str2 < "" Then
If StrComp(str1, str2, vbTextCompare) = 0 Then
RateLabel = 100
Else
str1 = Application.WorksheetFunction.Trim(str1)
str2 = Application.WorksheetFunction.Trim(str2)
str1Array = Split(str1, " ")
str2Array = Split(str2, " ")
'make sure str1Array is always the smaller of the two
If UBound(str1Array) UBound(str2Array) Then
strTemp = str1Array
str1Array = str2Array
str2Array = strTemp
End If
'count words and determine % that match
For i = 0 To UBound(str1Array)
For j = 0 To UBound(str2Array)
If StrComp(str1Array(i), str2Array(j),
vbTextCompare) = 0 Then
intMatch = intMatch + 1
Exit For
End If
Next j
Next i
intRating = (intMatch / (UBound(str1Array) + 1)) * (100 - 8
* (UBound(str1Array) - UBound(str2Array)))
If intRating 0 Then RateLabel = intRating Else RateLabel
= 0
End If
End If
MsgBox "str1 = " & str1 & ", str2 = " & str2 & ", rating = " &
RateLabel
End Function

  #2   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 163
Default text compare, percentage of similar words

Hi,

google for Levenshtein distance.

There are other algorithms,
which may return a correlation coefficient
that matches better the intuition of an ordinary language user,
one of these algorithms by me,
but far too complicated and far too slow. :-(

--
Greetings from Bavaria, Germany

Helmut Weber, MVP WordVBA

Win XP, Office 2003
"red.sys" & Chr$(64) & "t-online.de"


  #4   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 22
Default text compare, percentage of similar words

For your original example, 50% of the words match (3 of 6). It doesn't
catch "play" and "character" because they both appear once with a
question mark attached and once without. So I need to improve how it
handles punctuations.

Anyway, 50 would be the value (50% of the words match), but since the
strings are different lengths, it discounts the score via the line:

intRating = (intMatch / (UBound(str1Array) + 1)) * (100 - 8 *
(UBound(str1Array) - UBound(str2Array)))

Although I made a mistake in this line, which is why you get 58 instead
of 42. It should read:

intRating = (intMatch / (UBound(str1Array) + 1)) * (100 - 8 *
(UBound(str2Array) - UBound(str1Array)))

Thanks for pointing that out.

To give you some background, I have several tables of data I'm trying
to merge, but the labels are sometimes different even though it's
really the same line. The data itself is similar: close, but not
always the same. Since there are thousands of tables, I need a way for
the computer to decide whether the lines in two tables really refers to
the same thing. So my approach has been to rate how close the line in
one table is to the line in another (both label text and table data),
so that they can be merged without having to do it manually.

  #5   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 5,651
Default text compare, percentage of similar words

On 28 Jan 2006 15:20:37 -0800, wrote:

For your original example, 50% of the words match (3 of 6). It doesn't
catch "play" and "character" because they both appear once with a
question mark attached and once without. So I need to improve how it
handles punctuations.

Anyway, 50 would be the value (50% of the words match), but since the
strings are different lengths, it discounts the score via the line:

intRating = (intMatch / (UBound(str1Array) + 1)) * (100 - 8 *
(UBound(str1Array) - UBound(str2Array)))

Although I made a mistake in this line, which is why you get 58 instead
of 42. It should read:

intRating = (intMatch / (UBound(str1Array) + 1)) * (100 - 8 *
(UBound(str2Array) - UBound(str1Array)))

Thanks for pointing that out.

To give you some background, I have several tables of data I'm trying
to merge, but the labels are sometimes different even though it's
really the same line. The data itself is similar: close, but not
always the same. Since there are thousands of tables, I need a way for
the computer to decide whether the lines in two tables really refers to
the same thing. So my approach has been to rate how close the line in
one table is to the line in another (both label text and table data),
so that they can be merged without having to do it manually.


OK, well here is a routine that tells you how many of the words in the shorter
of two strings will match a word in the longer of the two strings.

It ignores punctuation and extra spaces; and it is also written to be
case-insensitive (although that is easily changed, if you wish).

It uses "regular expressions" to do the parsing and counting and comparing.

Because I am more familiar with it, I have used the regular expression
implementation available in Longre's free morefunc.xll add-in, available at
http://xcell05.free.fr/.

You could set a reference to Microsoft VBScript Regular Expressions 5.5 in your
VBE, but I'm not as familiar with its usage.

I output RateLabel as a decimal so I have set its type to Double.

In my original set of strings, where, ignoring case and punctuation, there were
5/6 matches in the second string, the output is 0.833333333

I leave it to you to make the adjustments you wish for text strings of
different lengths, if you choose to use this method.

======================================
Option Explicit

Function RateLabel(str1 As String, str2 As String) As Double
'fraction of words in shorter string that are also found in longer string

Dim l1 As Long, l2 As Long
Dim t As Long, v As Long
Dim a() As Long
Dim x, y

'count the number of words in each string
l1 = Run([regex.count], str1, "\w+")
l2 = Run([regex.count], str2, "\w+")


ReDim a(1 To IIf(l1 < l2, l1, l2))

For t = 1 To UBound(a)
a(t) = t
Next t

'comparison done with words (no punctuation) and case-insensitive

If l2 < l1 Then
x = Run([regex.mid], str2, "\w+", a)
y = Run([regex.find], str1, x, , False)
Else
x = Run([regex.mid], str1, "\w+", a)
y = Run([regex.find], str2, x, , False)
End If

If IsArray(y) Then
For t = 1 To UBound(y)
If y(t) 0 Then v = v + 1
Next t
RateLabel = v / UBound(y)
Else
'if shorter string only a single word, then no array is formed
RateLabel = IIf(y 0, 1, 0)
End If
MsgBox "str1 = " & str1 & ", str2 = " & str2 & ", rating = " & RateLabel
End Function
===============================


--ron


  #6   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 22
Default text compare, percentage of similar words

Thanks Helmut. That is a great start. I need to find something a
little more advanced than Levenshtein, so that "What are you?" and "You
are what?" get a higher score. But a great start, thanks. If you have
any more recommendations on search/language processing, I would love to
look into them.

  #7   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 163
Default text compare, percentage of similar words

Googling for

string similarity

returns thousands of hits.

--
Greetings from Bavaria, Germany

Helmut Weber, MVP WordVBA

Win XP, Office 2003
"red.sys" & Chr$(64) & "t-online.de"
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules

Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to compare similar entries, NOT duplicated ones? Serapk New Users to Excel 3 January 7th 09 03:51 PM
Find Similar Words In An Excel Document Troop New Users to Excel 6 June 8th 08 11:04 PM
How do I compare similar data in two separate workbooks? Bill Ridgeway Excel Discussion (Misc queries) 0 February 18th 07 09:33 PM
Compare 2 similar excel worksheets Nadia Excel Worksheet Functions 3 July 28th 06 02:20 PM


All times are GMT +1. The time now is 03:46 PM.

Powered by vBulletin® Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright ©2004-2025 ExcelBanter.
The comments are property of their posters.
 

About Us

"It's about Microsoft Excel"