Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 161
Default Efficient storage

I would like to find all the unique words in a text file and store them
in an Excel 2010 spreadsheet. I am looking for a reasonably
efficient method or algorithm to achieve this.

I was thinking, how about save all the words that begin with "A"
in the A column, all the "B" words get placed in the B column,
etc...

Also, each word should only be stored ONCE in the spreadsheet,
regardless of lowercase, uppercase, mixed, etc. So, if my text
file contains "apple", "APPLE", or "aPPle", only one instance of
"apple" should appear under the A column.

I'd appreciate anything to help me get started. Thank you.

Robert



  #2   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 3,872
Default Efficient storage

Hi Robert,

Am Thu, 7 May 2015 02:45:29 -0700 schrieb Robert Crandal:

Also, each word should only be stored ONCE in the spreadsheet,
regardless of lowercase, uppercase, mixed, etc. So, if my text
file contains "apple", "APPLE", or "aPPle", only one instance of
"apple" should appear under the A column.


the following code stores all unique words in lowercase in column A.
You can sort and move the words if you want:

Public objFSO As New Filesystemobject

Sub Test()
Dim strPath As String
Dim strFN As String
Dim objReadFile As Object, myDic As Object
Dim strText As String
Dim varChr As Variant, varTmp As Variant, varOut As Variant
Dim i As Long

'Modify path and file name
strPath = "C:\Users\Claus\Desktop\"
strFN = "Test.txt"
varChr = Array(",", ".", ":", "-", "_", "!", "?", "(", ")", Chr(10))
Set myDic = CreateObject("Scripting.Dictionary")

Set objReadFile = objFSO.opentextfile _
(strPath & strFN, forreading, tristatefalse)

Do While objReadFile.atendofstream < True
strText = strText & objReadFile.readline & " "
Loop
objReadFile.Close

For i = LBound(varChr) To UBound(varChr)
strText = Replace(strText, varChr(i), "")
Next

strText = Left(strText, Len(strText) - 1)
varTmp = Split(strText, " ")
For i = LBound(varTmp) To UBound(varTmp)
varTmp(i) = LCase(varTmp(i))
Next

For i = LBound(varTmp) To UBound(varTmp)
myDic(varTmp(i)) = varTmp(i)
Next
varOut = myDic.Items
Range("A1").Resize(UBound(varOut) + 1) = Application.Transpose(varOut)
End Sub


Regards
Claus B.
--
Vista Ultimate / Windows7
Office 2007 Ultimate / 2010 Professional
  #3   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 3,872
Default Efficient storage

Hi Robert,

Am Thu, 7 May 2015 12:30:24 +0200 schrieb Claus Busch:

the following code stores all unique words in lowercase in column A.
You can sort and move the words if you want:


to distribute the words in alphabetic order try:

Public objFSO As New Filesystemobject

Sub Test()
Dim strPath As String
Dim strFN As String
Dim objReadFile As Object, myDic As Object
Dim strText As String
Dim varChr As Variant, varTmp As Variant, varOut As Variant
Dim i As Long, LRow As Long, n As Long

Application.ScreenUpdating = False

'Modify path and file name
strPath = "C:\Users\Claus\Desktop\"
strFN = "Test.txt"
varChr = Array(",", ".", ":", "-", "_", "!", "?", "(", ")", Chr(10))
Set myDic = CreateObject("Scripting.Dictionary")

Set objReadFile = objFSO.opentextfile _
(strPath & strFN, forreading, tristatefalse)

Do While objReadFile.atendofstream < True
strText = strText & objReadFile.readline & " "
Loop
objReadFile.Close

For i = LBound(varChr) To UBound(varChr)
strText = Replace(strText, varChr(i), " ")
Next

strText = Application.Trim(strText)
varTmp = Split(strText, " ")
For i = LBound(varTmp) To UBound(varTmp)
varTmp(i) = LCase(varTmp(i))
Next

For i = LBound(varTmp) To UBound(varTmp)
myDic(varTmp(i)) = varTmp(i)
Next
varOut = myDic.Items
Range("A1") = "Header"
Range("A2").Resize(UBound(varOut) + 1) = Application.Transpose(varOut)
LRow = Cells(Rows.Count, 1).End(xlUp).Row

n = 2
For i = 97 To 122
Range("A1:A" & LRow).AutoFilter field:=1, Criteria1:="=" & Chr(i) &
"*"
If Application.Subtotal(3, Range("A:A")) 1 Then
Range("A2:A" & LRow).Copy Cells(1, n)
n = n + 1
End If
Next
Columns("A").Delete
Application.ScreenUpdating = True
End Sub


Regards
Claus B.
--
Vista Ultimate / Windows7
Office 2007 Ultimate / 2010 Professional
  #4   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 3,872
Default Efficient storage

Hallo Robert,

Am Thu, 7 May 2015 12:56:09 +0200 schrieb Claus Busch:

For i = LBound(varTmp) To UBound(varTmp)
varTmp(i) = LCase(varTmp(i))
Next


you don't need to set each word in the array to LCase. You can set the
whole string to LCase at once:

strText = Application.Trim(strText)
strText = LCase(strText)
varTmp = Split(strText, " ")


Regards
Claus B.
--
Vista Ultimate / Windows7
Office 2007 Ultimate / 2010 Professional
  #5   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 161
Default Efficient storage

"Claus Busch" wrote:

Set myDic = CreateObject("Scripting.Dictionary")


BTW, I never knew about this "Scripting Dictionary".
Is this an object that can store an entire library of words?

If so, do you think I would be better off storing all of my
unique words in the dictionary object, rather than saving them
in columns on the spreadsheet?

If I am able to store all my unique words in a memory object,
I could then iterate through the object and output all the words
to a text file. That might even be better for me.







  #6   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 3,872
Default Efficient storage

Hi Robert,

Am Thu, 7 May 2015 11:20:35 -0700 schrieb Robert Crandal:

"Claus Busch" wrote:

Set myDic = CreateObject("Scripting.Dictionary")


BTW, I never knew about this "Scripting Dictionary".
Is this an object that can store an entire library of words?


the Scripting.Dictionary is my favorite to create unique items.

With

For i = LBound(varTmp) To UBound(varTmp)
myDic(varTmp(i)) = varTmp(i)
Next
varOut = myDic.Items

you have all your unique words in varOut.


Regards
Claus B.
--
Vista Ultimate / Windows7
Office 2007 Ultimate / 2010 Professional
  #7   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 161
Default Efficient storage

"Claus Busch" wrote:

Public objFSO As New Filesystemobject


This line gives me an error. Here is the error message:

Compile error: User-defined type not defined

Am I missing something here?

Thanks Claus



  #8   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 3,872
Default Efficient storage

Hi Robert,

Am Thu, 7 May 2015 11:08:48 -0700 schrieb Robert Crandal:

Public objFSO As New Filesystemobject


This line gives me an error. Here is the error message:


you have to activate the reference to "Microsoft Scripting Runtime"


Regards
Claus B.
--
Vista Ultimate / Windows7
Office 2007 Ultimate / 2010 Professional
  #9   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 1,182
Default Efficient storage

"Claus Busch" wrote:

Public objFSO As New Filesystemobject


This line gives me an error. Here is the error message:

Compile error: User-defined type not defined

Am I missing something here?

Thanks Claus


You need to set a project 'Reference' to the Scripting runtime for
using this 'early binding' method! This is okay for development but my
*preference* is to use 'late binding' for final release...

In the General Declarations section of a standard module:
Public oFSO As Object

In the main startup procedure (ie: "Auto_Open") or an "InitGlobals"
procedure called by your main startup procedure...

Set oFSO = CreateObject("Scripting.FileSystemObject")

...to make oFSO available to your project during runtime.

In the shutdown procedure (ie: "Auto_Close") release the memory...

Set oFSO = Nothing

--
Garry

Free usenet access at http://www.eternal-september.org
Classic VB Users Regroup!
comp.lang.basic.visual.misc
microsoft.public.vb.general.discussion


  #10   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 3,872
Default Efficient storage

Hi Robert,

Am Thu, 7 May 2015 11:08:48 -0700 schrieb Robert Crandal:

Public objFSO As New Filesystemobject


here is a revised version:

Sub Test()
Dim strPath As String
Dim strFN As String
Dim objReadFile As Object, myDic As Object, objFSO As Object
Dim strText As String
Dim varChr As Variant, varTmp As Variant, varOut As Variant
Dim i As Long, LRow As Long, n As Long

Application.ScreenUpdating = False

'Modify path and file name
strPath = "C:\Users\Claus\Desktop\"
strFN = "Test.txt"
'Array with the expected punctuation marks
varChr = Array(",", ".", ":", "-", "_", "!", "?", "(", ")", "'",
Chr(10))

Set myDic = CreateObject("Scripting.Dictionary")
Set objFSO = CreateObject("Scripting.FileSystemObject")

'Opens the text file and read the text into a string
Set objReadFile = objFSO.opentextfile(strPath & strFN)

Do While objReadFile.atendofstream < True
strText = strText & objReadFile.readline & " "
Loop
objReadFile.Close

'Deletes all punctuation marks
For i = LBound(varChr) To UBound(varChr)
strText = Replace(strText, varChr(i), " ")
Next

'Handles special cases like I'll, it's and so on
'Customize it depending on your text file
strText = Replace(Replace(Replace(Replace(Replace(strText, " s ", " is
"), " ll ", _
" will "), " d ", " had "), " re ", " are "), " ve ", " have ")

'Deletes superfluous spaces
strText = Application.Trim(strText)
'Change all words to lower case
strText = LCase(strText)
'Writes the string into an array
varTmp = Split(strText, " ")

'Creates unique words
For i = LBound(varTmp) To UBound(varTmp)
myDic(varTmp(i)) = varTmp(i)
Next
varOut = myDic.Items

'Writes the unique words to column A with a header for filtering
Range("A1") = "Header"
Range("A2").Resize(UBound(varOut) + 1) = Application.Transpose(varOut)
LRow = Cells(Rows.Count, 1).End(xlUp).Row

'Distrubutes the words in alphabetic order to columns
n = 2
For i = 97 To 122
Range("A1:A" & LRow).AutoFilter field:=1, Criteria1:="=" & Chr(i) &
"*"
If Application.Subtotal(3, Range("A:A")) 1 Then
Range("A2:A" & LRow).Copy Cells(1, n)
n = n + 1
End If
Next
Columns("A").Delete
Columns("A:Z").AutoFit
Application.ScreenUpdating = True
End Sub


Regards
Claus B.
--
Vista Ultimate / Windows7
Office 2007 Ultimate / 2010 Professional


  #11   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 161
Default Efficient storage

"Claus Busch" wrote:

here is a revised version:

Sub Test()
' [code snipped]
End Sub


Thanks Claus. This is a very good start. It worked
nicely for small input files. I will probably need to
make some modifications so it takes numeric words
into considersation, such as "10", "100", etc...

But, it works great so far.





  #12   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 3,872
Default Efficient storage

Hi Robert,

This is a very good start. It worked
nicely for small input files. I will probably need to
make some modifications so it takes numeric words
into considersation, such as "10", "100", etc...


try it with Regular Expressions:

Sub Test2()
Dim strPath As String
Dim strFN As String
Dim objReadFile As Object, myDic As Object, objFSO As Object, re As
Object
Dim strText As String
Dim varText() As Variant
Dim i As Long, LRow As Long, n As Long
Dim ptrn, Match, Matches

Application.ScreenUpdating = False

'Modify path and file name
strPath = "C:\Users\Claus\Desktop\"
strFN = "Test.txt"

Set myDic = CreateObject("Scripting.Dictionary")
Set objFSO = CreateObject("Scripting.FileSystemObject")
Set re = CreateObject("vbscript.regexp")

'Opens the text file and read the text into a string
Set objReadFile = objFSO.opentextfile(strPath & strFN)

Do While objReadFile.atendofstream < True
strText = strText & objReadFile.readline & " "
Loop
objReadFile.Close

strText = Application.Trim(LCase(strText))

ptrn = "\w+"
re.Pattern = ptrn
re.IgnoreCase = False
re.Global = True
Set Matches = re.Execute(strText)
ReDim Preserve varText(Matches.Count - 1)
For Each Match In Matches
varText(n) = CStr(Match.Value)
n = n + 1
Next
Range("A:A").NumberFormat = "@"
Range("A1") = "Header"
Range("A2").Resize(UBound(varText) + 1) = _
Application.Transpose(varText)
LRow = Cells(Rows.Count, 1).End(xlUp).Row
'Distrubutes the words in alphabetic order to columns
n = 2
For i = 97 To 122
Range("A1:A" & LRow).AutoFilter Field:=1, Criteria1:="=" & Chr(i) &
"*"
If Application.Subtotal(3, Range("A:A")) 1 Then
Range("A2:A" & LRow).Copy Cells(1, n)
n = n + 1
End If
Next
For i = 48 To 57
Range("A1:A" & LRow).AutoFilter Field:=1, Criteria1:="=" & Chr(i) &
"*"
If Application.Subtotal(3, Range("A:A")) 1 Then
Range("A2:A" & LRow).Copy Cells(1, n)
n = n + 1
End If
Next

Columns("A").Delete
Columns("A:AJ").AutoFit
Application.ScreenUpdating = True
End Sub


Regards
Claus B.
--
Vista Ultimate / Windows7
Office 2007 Ultimate / 2010 Professional
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules

Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
XML Data Storage Rookie_User Excel Discussion (Misc queries) 0 June 30th 08 07:11 PM
Persistent Storage for Add-Ins ? Brian Herbert Withun Excel Programming 2 January 25th 08 02:24 PM
cell value storage Leo Rod Excel Programming 2 August 30th 07 03:47 PM
Data Storage gregory.barrett Excel Discussion (Misc queries) 3 March 30th 06 10:23 PM
Variable Storage gti_jobert[_10_] Excel Programming 3 February 8th 06 11:28 AM


All times are GMT +1. The time now is 03:28 AM.

Powered by vBulletin® Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright ©2004-2025 ExcelBanter.
The comments are property of their posters.
 

About Us

"It's about Microsoft Excel"