ExcelBanter

ExcelBanter (https://www.excelbanter.com/)
-   Excel Discussion (Misc queries) (https://www.excelbanter.com/excel-discussion-misc-queries/)
-   -   Extracting specific data from inconsistant multiple word documents (https://www.excelbanter.com/excel-discussion-misc-queries/214404-extracting-specific-data-inconsistant-multiple-word-documents.html)

Matt Bennette[_2_]

Extracting specific data from inconsistant multiple word documents
 
Hi Guys,

I have about a 1000 word documents and need to extract all the email
addresses from them. Ideally the result would place the filename in column A
and the email address in column B. None of the documents are consistant in
format. I have got as far as listing the file names.

Many thanks.

Ron Rosenfeld

Extracting specific data from inconsistant multiple word documents
 
On Mon, 22 Dec 2008 01:47:01 -0800, Matt Bennette
wrote:

Hi Guys,

I have about a 1000 word documents and need to extract all the email
addresses from them. Ideally the result would place the filename in column A
and the email address in column B. None of the documents are consistant in
format. I have got as far as listing the file names.

Many thanks.


There are ways of recognizing email addresses within text strings. For
example, depending on your data, you could look for the "@" character. Or
perhaps you could look for a string in which the "@" is followed by one or more
characters; then a ".", then a few more characters.

--ron

Matt Bennette[_2_]

Extracting specific data from inconsistant multiple word docum
 
Hi Ron

I am pretty new to VBA, but was hoping that I could engineer a script that
would produce a spreadsheet with the file name in column A and the Email
address in Column B

This is what I am currently using to get the filenames

Function GetFileList(FileSpec As String) As Variant
' Returns an array of filenames that match FileSpec
' If no matching files are found, it returns False

Dim FileArray() As Variant
Dim FileCount As Integer
Dim FileName As String

On Error GoTo NoFilesFound

FileCount = 0
FileName = Dir(FileSpec)
If FileName = "" Then GoTo NoFilesFound

' Loop until no more matching files are found
Do While FileName < ""
FileCount = FileCount + 1
ReDim Preserve FileArray(1 To FileCount)
FileArray(FileCount) = FileName
FileName = Dir()
Loop
GetFileList = FileArray
Exit Function

' Error handler
NoFilesFound:
GetFileList = False
End Function


Sub test()
Dim p As String, x As Variant

p = "C:\Documents and Settings\Administrator\Desktop\Chefs\*.*"
x = GetFileList(p)

Select Case IsArray(x)
Case True 'files found
MsgBox UBound(x)
Sheets("Sheet1").Range("A:A").Clear
For i = LBound(x) To UBound(x)
Sheets("Sheet1").Cells(i, 1).Value = x(i)
Next i
Case False 'no files found
MsgBox "No matching files"
End Select
End Sub

Which I confess was donated by one of the fab subscribers here

The strings would be of differing lengths, so how would wild card work is
there a specific syntax i.e *?*@*?*.*?*. And what about VBA referencing Word
Commands to perform this

Sorry I'm a complete novice and would really appreciate some help on this

Many thanks

"Ron Rosenfeld" wrote:

On Mon, 22 Dec 2008 01:47:01 -0800, Matt Bennette
wrote:

Hi Guys,

I have about a 1000 word documents and need to extract all the email
addresses from them. Ideally the result would place the filename in column A
and the email address in column B. None of the documents are consistant in
format. I have got as far as listing the file names.

Many thanks.


There are ways of recognizing email addresses within text strings. For
example, depending on your data, you could look for the "@" character. Or
perhaps you could look for a string in which the "@" is followed by one or more
characters; then a ".", then a few more characters.

--ron


Ron Rosenfeld

Extracting specific data from inconsistant multiple word docum
 
On Mon, 22 Dec 2008 08:15:01 -0800, Matt Bennette
wrote:

The strings would be of differing lengths, so how would wild card work is
there a specific syntax i.e *?*@*?*.*?*. And what about VBA referencing Word
Commands to perform this


I don't know about "VBA referencing Word Commands"

The following uses Regular Expressions:

Is there only one email address per file? Or could there be multiple email
addresses in a file.

Here is a routine which, when applied against a text string of unspecified
length, will return the first string that looks like an email address. The
pattern does NOT contain all the rules for validating an email address, but
perhaps this method will get you started. Post back with more questions as
needed.

=====================================
Option Explicit
Public Const str As String = "now is the time for "
'--------------------------------------

Sub ExtrEmail()
Dim re As Object, mc As Object
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.Pattern = "\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b"
If re.test(str) = True Then
Set mc = re.Execute(str)
Debug.Print mc(0).Value
End If
End Sub
==================================
--ron

Matt Bennette[_2_]

Extracting specific data from inconsistant multiple word docum
 
In general, there is only one email address, however I dont understand how I
can apply this code to multiple word documents.

Ideally I need the result to list the file name and the extracted email
address.

"Ron Rosenfeld" wrote:

On Mon, 22 Dec 2008 08:15:01 -0800, Matt Bennette
wrote:

The strings would be of differing lengths, so how would wild card work is
there a specific syntax i.e *?*@*?*.*?*. And what about VBA referencing Word
Commands to perform this


I don't know about "VBA referencing Word Commands"

The following uses Regular Expressions:

Is there only one email address per file? Or could there be multiple email
addresses in a file.

Here is a routine which, when applied against a text string of unspecified
length, will return the first string that looks like an email address. The
pattern does NOT contain all the rules for validating an email address, but
perhaps this method will get you started. Post back with more questions as
needed.

=====================================
Option Explicit
Public Const str As String = "now is the time for "
'--------------------------------------

Sub ExtrEmail()
Dim re As Object, mc As Object
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.Pattern = "\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b"
If re.test(str) = True Then
Set mc = re.Execute(str)
Debug.Print mc(0).Value
End If
End Sub
==================================
--ron



All times are GMT +1. The time now is 09:51 PM.

Powered by vBulletin® Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
ExcelBanter.com