View Single Post
  #7   Report Post  
Posted to microsoft.public.excel.programming
SauQ SauQ is offline
external usenet poster
 
Posts: 13
Default Pattern match and grouped items

Hi Ron,
I have tried your sub on my other longer list and it works like charm.
The results spanned slightly over a 100 lines, compared to a few
thousand lines that I am stuck with earlier on,
I' ll say your solution is a major improvement in efficiency though I
still have to pick up the ones I want manually.
My zillion thanks to you. Heartfelt appreciation for your effort and
time.
SauQ

Ron Rosenfeld wrote:
On Sun, 5 Jul 2009 19:47:52 -0700 (PDT), SauQ wrote:

Hi Ron

The purpose of this process is to come up with a total cost for all
similar items (e.g the total cost of printers, regardless of brand -
Epson/Panasonic).

The problem is to come up with a list of keywords of items, as there
are a few thousand rows of different items with different brands (and
many are overlapping matches as you have pointed out). I do not know
entirely what are the items contain in the list.

I was hoping that if excel could highlight / identify these items
(say, keyword=printer) which are similar then I could use the
SUMPRODUCT function to total up all cost based on the keywords.

But then again, excel obviously would not know how to distinguish
between an item (handphone, printer, cond, etc) from brands (Epson,
panasonic, etc).

I reckon the best way is to get excel to list all duplicate words
(regardless if its item / brand), then I will manually pick out the
items from the list.

I wonder if there is a better way around this.
Many thanks again for your effort and time in looking at this.
SauQ



Ron Rosenfeld wrote:
On Thu, 2 Jul 2009 21:23:54 -0700 (PDT), SauQ wrote:

Dear Ron, you are absolutely correct, silly of me not to notice there
are other duplicate words (Nokia, Panasonic, Epson) as well.

In this case, i need excel to highlight rows with certain % of words
matched. Maybe an option to choose / select the % to be highlighted
(e.g: above 50% match found is to be highlighted).

Many thanks again,
SauQ

I think you need a better description of what might constitute a valid match;
and also rules for what should happen when the same line matches other lines,
but with different words (e.g. Epson and printer have overlapping matchings).

Perhaps it would be best to back up and look at the purpose of this process.
For example, if you are trying to find how many entries there are that refer
to, let us say, "phones", it would probably be a lot simpler to develop a list
of keywords that you would search for.

For example, if you decided that your list of key words was

phone
printer
cond

you could enter those words in a series of cells, and then just check the data
against those cells to see what's where. You'd still have the problem of
multiple matchings, but you could probably flag the entry somehow.
--ron



I think you will have to develop your keyword list manually. There are
routines that can generate a list of individual words, sorted either
alphabetically or by word count. Then you can manually decide which you want
to use as keywords (perhaps with wild cards) in a SUMPRODUCT or SUMIF formula.

Given your initial example:

YORK A/COND
SANYO 1.0HP air-cond
CANON laser printer (model : L-P'TER*LBP-3050)
EPSON - Batching system printer
EPSON - Dot matrix printer
Nokia 1200 - handphone
HP printer
Panasonic K Printer
Panasonic X printer
Panasonic P-P1121 printer
Nokia N6070 handphone
Nokia 1200 handphone
Microphone
Panasonic printer
2nd hand hand phone


And defining a "word" as having only letters, digits or hyphens; and also being
at least two characters in length, here is a list of individual words sorted by
frequency (descending):

printer 8
Panasonic 4
Nokia 3
handphone 3
COND 2
EPSON 2
1200 2
hand 2
YORK 1
SANYO 1
0HP 1
air-cond 1
CANON 1
laser 1
model 1
L-P 1
Batching 1
system 1
Dot 1
matrix 1
HP 1
P-P1121 1
N6070 1
Microphone 1
2nd 1
phone 1

And here is a list sorted alphabetically (ascending):

0HP 1
1200 2
2nd 1
air-cond 1
Batching 1
CANON 1
COND 2
Dot 1
EPSON 2
hand 2
handphone 3
HP 1
laser 1
L-P 1
matrix 1
Microphone 1
model 1
N6070 1
Nokia 3
Panasonic 4
phone 1
P-P1121 1
printer 8
SANYO 1
system 1
YORK 1

There are instructions in the macro posted below for how to change from numeric
to alpha sorting; and also how to change from ascending to descending sorts.

Also, in the macro, as of now, the data source (rSrc) is set to Selection; and
the output destination is set to start at C1 (rDest). These can be changed or
set up in different ways.

Maybe this will help.

To enter this Macro (Sub), <alt-F11 opens the Visual Basic Editor.
Ensure your project is highlighted in the Project Explorer window.
Then, from the top menu, select Insert/Module and
paste the code below into the window that opens.

To use this Macro (Sub):
Select your data range (unless you have modified the Sub to do it
automatically).
Then <alt-F8 opens the macro dialog box. Select the macro by name, and
<RUN.

===================================
Option Explicit
Option Compare Text
Sub UniqueWordList()
Dim rSrc As Range, rDest As Range, c As Range
Dim cWordList As Collection
Dim res() As Variant
Dim w() As String
Dim i As Long

Set cWordList = New Collection
Set rSrc = Selection
Set rDest = Range("C1")
rDest.EntireColumn.NumberFormat = "@"
For Each c In rSrc
w = Split(c.Value)
For i = 0 To UBound(w)
w(i) = StripWord(w(i))
If Not w(i) = "" Then
On Error Resume Next
cWordList.Add Item:=w(i), Key:=w(i)
On Error GoTo 0
End If
Next i
Next c

'transfer words to results array
ReDim res(1 To cWordList.Count, 0 To 1)
For i = 1 To cWordList.Count
res(i, 0) = cWordList(i)
Next i

'get counts
For i = LBound(res) To UBound(res)
For Each c In rSrc
res(i, 1) = res(i, 1) + CountWord(c.Value, res(i, 0))
Next c
Next i

'sort alpha: d=0; sort numeric d=1
'there are various ways of sorting
BubbleSort res, 0

rDest.CurrentRegion.Clear
For i = LBound(res) To UBound(res)
rDest.Offset(i, 0).NumberFormat = "@"
rDest.Offset(i, 0).Value = res(i, 0)
'For just lowercase output, use:
'rDest.Offset(i, 0).Value = LCase(res(i, 0))
rDest.Offset(i, 1).Value = res(i, 1)
Next i
End Sub

Private Function StripWord(s As String) As String
Dim re As Object, mc As Object, m As Object
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.Global = True
re.Pattern = "[-\w]{2,}"
If re.test(s) = True Then
Set mc = re.Execute(s)
StripWord = mc(0).Value
End If
Set re = Nothing
End Function

Private Function CountWord(ByVal s As String, sPat) As Long
Dim re As Object, mc As Object
Set re = CreateObject("vbscript.regexp")
re.Global = True
re.IgnoreCase = True
re.Pattern = "\b" & sPat & "\b"

Set mc = re.Execute(s)
CountWord = mc.Count
End Function

Private Sub BubbleSort(TempArray As Variant, d As Long) 'd is 0 based dimension
Dim temp(0, 1) As Variant
Dim i As Integer
Dim NoExchanges As Integer

' Loop until no more "exchanges" are made.
Do
NoExchanges = True

' Loop through each element in the array.
For i = LBound(TempArray) To UBound(TempArray) - 1

' If the element is less than the element
' following it, exchange the two elements.
' change "<" to "" to sort ascending
If TempArray(i, d) TempArray(i + 1, d) Then
NoExchanges = False
temp(0, 0) = TempArray(i, 0)
temp(0, 1) = TempArray(i, 1)
TempArray(i, 0) = TempArray(i + 1, 0)
TempArray(i, 1) = TempArray(i + 1, 1)
TempArray(i + 1, 0) = temp(0, 0)
TempArray(i + 1, 1) = temp(0, 1)

End If
Next i
Loop While Not (NoExchanges)
End Sub
================================================== ===
--ron