View Single Post
  #5   Report Post  
Posted to microsoft.public.excel.programming
Zone[_3_] Zone[_3_] is offline
external usenet poster
 
Posts: 373
Default Find Unique Records; Must Use Fuzzy Logic

Tom, so you test-drove it, huh? Four matching letters out of five would be
80%, so that seems reasonable. James
"Tom Ogilvy" wrote in message
...
MsgBox StrSimilar("Freud", "Frdue", True)
gives a 0.8 as well.

--
Regards,
Tom Ogilvy


"Zone" wrote:

Harlan Grove sent me this function years ago when I was trying to compare
two strings to see how similar they were. I also included a simple sub
to
run it. It returns a number from 0 to 1, depending on how similar it
thinks
the strings are. Works good. The three arguments are the strings to be
compared and a boolean that tells it whether to match the whole string or
any part of it. I haven't used it in a while now, so I'd advise just
setting the 3rd argument to True for now. Hope this helps! James

Sub TrySimilar()
MsgBox StrSimilar("Frued", "Freud", True)
End Sub

Function StrSimilar(s1 As String, s2 As String, FindWhole As Boolean) As
Double
'function returns a numerical grade for 2 similar strings, 1.00 being
perfect
'by H Grove
Dim I As Long, j As Long, k As Long, n(2) As Long
Dim c1 As String, c2 As String
Const alphanum As String = "1234567890abcdefghijklmnopqrstuvwxyz "
s1 = LCase(s1)
For I = 1 To Len(s1)
If Not InStr(alphanum, Mid(s1, I, 1)) 0 Then Mid(s1, I, 1) = " "
Next I
s1 = Application.WorksheetFunction.Trim(s1)
s2 = LCase(s2)
For j = 1 To Len(s2)
If Not InStr(alphanum, Mid(s2, j, 1)) 0 Then Mid(s2, j, 1) = " "
Next j
s2 = Application.WorksheetFunction.Trim(s2)
j = 1
n(1) = 0
For I = 1 To Len(s1)
c1 = LCase(Mid(s1, I, 1))
k = 0
Do
c2 = LCase(Mid(s2, j + k, 1))
k = k + 1
Loop Until j + k Len(s2) Or c1 = c2
If c1 = c2 Then
n(1) = n(1) + 1
If j < Len(s2) Then j = j + 1 Else Exit For
End If
Next I
I = 1
n(2) = 0
For j = 1 To Len(s2)
c2 = LCase(Mid(s2, j, 1))
k = 0
Do
c1 = LCase(Mid(s1, I + k, 1))
k = k + 1
Loop Until I + k Len(s1) Or c1 = c2
If c1 = c2 Then
n(2) = n(2) + 1
If I < Len(s1) Then I = I + 1 Else Exit For
End If
Next j
If FindWhole Then
StrSimilar = CDbl(Application.WorksheetFunction.Min(n(1), n(2))) _
/ CDbl(Application.WorksheetFunction.Max(Len(s1), Len(s2)))
Else
StrSimilar = CDbl(n(2)) / Len(s2)
End If
End Function


"Tom Ogilvy" wrote in message
...
While your waiting for a good answer, as an excercise -
Mark the uniques and for the duplicates, list which unique it
duplicates:

Frehu
Freuh
Erueh
Fruit
Freh
Rfeuh
Ereuh
Feurh
Fruhe
Furth
Feruh
Ferhu
Fruh
Frueh
Fureh

Now right down the rules you used to do it - define "our purposes".
then
maybe someone will have some ideas.

--
Regards,
Tom Ogilvy


"ryguy7272" wrote:

My supervisor and I are trying to come up with a way of finding and
eliminating, all duplicates from a large Excel worksheet which came
from
BCM
in Outlook. I am contemplating a few different ways of doing this. I
found
a free utility on www.downloads.com that allows a user to do a
'complex
filter' to eliminate duplicates, but it's not really giving me the
results I
am looking for. I sorted by Company (Column F) and then by last name
(Column
D) and then by first name (Column B). I am now wondering if there is
a
way
to use some kind of fuzzy logic to do a search for values that are
almost
unique and hide the remainder of the rows, or almost duplicate and
hide
the
remainder of the rows. The issue is that Excel identifies lots of
'unique'
records because it identifies two people with two different office
addresses
as two different records, but for our purposes this is one contact.
Similarly, a contact's name could be spelled Freuh, in the personal
contacts
part of BCM, and the name could also be spelled in Frueh, in the
public
contacts part of BCM. Again, these are two 'unique' records, but
again,
for
our purposes this is one contact. All company names are listed in
Column
F,
all last names are in Column D, and all first names are in Column B.
I
would
like to copy/paste all data on all rows with unique records (F, D, &
B)
to a
new sheet, or hide all rows with dupes. Any ideas? Would Access be
able
to
handle this?

Thanks, as always!
Ryan---




--
RyGuy