Home |
Search |
Today's Posts |
#1
![]()
Posted to microsoft.public.excel,microsoft.public.excel.programming
|
|||
|
|||
![]()
I recieve data every few weeks (about 65 files, each having up to 7000 or
so) that I need to evaluate. I want to work in Excel., but the data comes in a PDF format. I can not recieve data in anything but PDF (too simple, I know) The fields are fixed len, except the description field (second to last feild) Each record wraps around 2 or three lines I need to get each field into it's own cell. I probably could get it IF I can figure out how to get each record into it's own row.... I have tried to save to TEXT from adobe. then opened it up under word and tried to seperate at the records, but was not sucessful. Here is a sample of one of the files: http://www.johnsonclan.net/sample.pdf Any help would be greatly appreciated! Thanks, Mc |
#2
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
Hi,
I looked up "PDF to Excel" on the Web and found the following site (among others). I haven't tried it but I hope it gives you a start. http://www.library.mcgill.ca/edrs/se...l#basicexports HTH "Mctabish" wrote: I recieve data every few weeks (about 65 files, each having up to 7000 or so) that I need to evaluate. I want to work in Excel., but the data comes in a PDF format. I can not recieve data in anything but PDF (too simple, I know) The fields are fixed len, except the description field (second to last feild) Each record wraps around 2 or three lines I need to get each field into it's own cell. I probably could get it IF I can figure out how to get each record into it's own row.... I have tried to save to TEXT from adobe. then opened it up under word and tried to seperate at the records, but was not sucessful. Here is a sample of one of the files: http://www.johnsonclan.net/sample.pdf Any help would be greatly appreciated! Thanks, Mc |
#3
![]()
Posted to microsoft.public.excel,microsoft.public.excel.programming
|
|||
|
|||
![]()
we could not save a copy of your file, but at least we could Select Text/
copy and paste into Excel Would this be sufficient for you please? "Mctabish" wrote in message . .. I recieve data every few weeks (about 65 files, each having up to 7000 or so) that I need to evaluate. I want to work in Excel., but the data comes in a PDF format. I can not recieve data in anything but PDF (too simple, I know) The fields are fixed len, except the description field (second to last feild) Each record wraps around 2 or three lines I need to get each field into it's own cell. I probably could get it IF I can figure out how to get each record into it's own row.... I have tried to save to TEXT from adobe. then opened it up under word and tried to seperate at the records, but was not sucessful. Here is a sample of one of the files: http://www.johnsonclan.net/sample.pdf Any help would be greatly appreciated! Thanks, Mc |
#4
![]()
Posted to microsoft.public.excel,microsoft.public.excel.programming
|
|||
|
|||
![]()
Further to our earlier response, we note the layout of the PDF report is not
consistent. Some places where zero entries are made as ".00" whereas some places there is no entries at all. If this inconsistence can be rectified, we believe the report can be converted to Excel with a little help from VBA please. Because MUN SHE CHE MHE tax have no entry, we have to adjust position of the data. But we trust the following conversion is correct. NAME JOHNSON KAREN G CO. 1 YR. 0 C/S# 15 ST.TAX 52.78 CO.TAX 354.03 MUN.TAX FEES 10 INTER 21.35 AD.COST 18 OTHER 15 TOTAL 471.16 ASSES.CL 2 MUN.CODE 2 A.V. 8120 SHE TAX CHE TAX MHE TAX OLD PARCEL NO. 0000000000000000000000 NEW PARCEL NO. 0121001430010320000100 DESCRIPTION LOT etc "Mctabish" wrote in message . .. I recieve data every few weeks (about 65 files, each having up to 7000 or so) that I need to evaluate. I want to work in Excel., but the data comes in a PDF format. I can not recieve data in anything but PDF (too simple, I know) The fields are fixed len, except the description field (second to last feild) Each record wraps around 2 or three lines I need to get each field into it's own cell. I probably could get it IF I can figure out how to get each record into it's own row.... I have tried to save to TEXT from adobe. then opened it up under word and tried to seperate at the records, but was not sucessful. Here is a sample of one of the files: http://www.johnsonclan.net/sample.pdf Any help would be greatly appreciated! Thanks, Mc |
#5
![]()
Posted to microsoft.public.excel,microsoft.public.excel.programming
|
|||
|
|||
![]()
Mctabish wrote...
I recieve data every few weeks (about 65 files, each having up to 7000 or so) that I need to evaluate. I want to work in Excel., but the data comes in a PDF format. I can not recieve data in anything but PDF (too simple, I know) .... Your best choices are either convincing whoever sends you these files to send .XLS or even plain text files rather than .PDF files or use better tools than Excel to deal with these files. The fields are fixed len, except the description field (second to last feild) Each record wraps around 2 or three lines I need to get each field into it's own cell. I probably could get it IF I can figure out how to get each record into it's own row.... I have tried to save to TEXT from adobe. then opened it up under word and tried to seperate at the records, but was not sucessful. .... There's a utility program named pdf2text that used to be available on Simtel. I haven't been able to find it there today, but it may be available from other sites (I haven't looked). If you can't find it, then you could consider GhostScript, which can convert PDF files to other formats. If these files have inconsistent fields in different records but a common record terminator or separator, you scripting languages like Perl or Python to parse the actual fields and use default values for any missing fields. You could also do this in Excel once you had the text from the .PDF files, but it's more difficult in Excel. |
#6
![]()
Posted to microsoft.public.excel,microsoft.public.excel.programming
|
|||
|
|||
![]()
Thanks for the suggestions. Currently, it looks like PDF2TEXT works the best
to get the data over to excel. but... Is there a way to combine all rows in a column that have data in them until the next empty row in that column? This code seems to work OK if I only have 2 rows of data: =IF(LEN(TRIM(C6))0,IF(LEN(TRIM(C7))0,C6 & " | " & C7,""),"") (first test the current row, if it is empty, do nothing... Then test the next row, if it has data, combine the current row, then add the "pipe" symbol as a seperator, the add the next row) But there are varible rows of data for each entry (MOST are 2 rows only, but there are a lot that have 3, and more rows) Thanks, Mc "Mctabish" wrote in message . .. I recieve data every few weeks (about 65 files, each having up to 7000 or so) that I need to evaluate. I want to work in Excel., but the data comes in a PDF format. I can not recieve data in anything but PDF (too simple, I know) The fields are fixed len, except the description field (second to last feild) Each record wraps around 2 or three lines I need to get each field into it's own cell. I probably could get it IF I can figure out how to get each record into it's own row.... I have tried to save to TEXT from adobe. then opened it up under word and tried to seperate at the records, but was not sucessful. Here is a sample of one of the files: http://www.johnsonclan.net/sample.pdf Any help would be greatly appreciated! Thanks, Mc |
#7
![]()
Posted to microsoft.public.excel,microsoft.public.excel.programming
|
|||
|
|||
![]()
"Mctabish" wrote...
Thanks for the suggestions. Currently, it looks like PDF2TEXT works the best to get the data over to excel. but... Good that you found pdf2text. Is there a way to combine all rows in a column that have data in them until the next empty row in that column? I've looked at your PDF file. If the converted text looked like the PDF file, there'd be blank lines between records which you could use as record separators, but they seem to be omitted from the converted text. If that's so, then your PDF file requires *you* manually indicating which lines begin new records. If this is from a governmental entity, and PDF files are all they provide, then you're very likely out of luck. Also the resulting text shows only a single space between all nonblank fields. While it'd be possible to deal with blank SHE/CHE/MHE tax fields, which precede the unmistakable parcel number fields, there's no simple way to deal with blank MUN tax fields. The closest you could come would be parsing out the available fields using formulas like A1: =LEFT(B!B1,FIND(" 1 ",B!B1)-1) B1: =MID(B!$B1,SUMPRODUCT(LEN($A1:A1)+1)+1, FIND(" ",MID(B!$B1&" ",SUMPRODUCT(LEN($A1:A1)+1)+1,1024))-1) Fill B1 right into C1:IV1. This will parse out each word or token in the description field. Then you'd need to construct records from the fields, making decisions about when MUN/SHE/CHE/MHE tax fields were blank. When would these tax fields be blank? If there's no pattern to it, then you have a mostly manual exercise on your hands. You *may* be better off using GhostScript or some other software to convert the PDF files to HTML if the resulting HTML files mostly preserve the PDF file formatting. |
#8
![]()
Posted to microsoft.public.excel,microsoft.public.excel.programming
|
|||
|
|||
![]()
All of the fields "should" be fixed len, except description. I do not care
about the CL CODE after description, so I could use that as part of description (or use right() to trim it off) There is a blank line between each record (I have cleaned off all of the page header info BTW), so what I want to do is to append all records that are between two blank lines, and then I can parse for the len of each field. I am 'trying' to make this happen with a few as possible human steps, as I have to do this every week, on about 60 plus files, some of the files being a few thousand records. You are right, these files ARE from a state goverenment agency, that is why I said that I have to work with the PDF to start with. Oh, BTW, I was mistaken.... I could NOT find PDF2TEST, it is PDF2TXT. Thanks, Mc "Harlan Grove" wrote in message ... "Mctabish" wrote... Thanks for the suggestions. Currently, it looks like PDF2TEXT works the best to get the data over to excel. but... Good that you found pdf2text. Is there a way to combine all rows in a column that have data in them until the next empty row in that column? I've looked at your PDF file. If the converted text looked like the PDF file, there'd be blank lines between records which you could use as record separators, but they seem to be omitted from the converted text. If that's so, then your PDF file requires *you* manually indicating which lines begin new records. If this is from a governmental entity, and PDF files are all they provide, then you're very likely out of luck. Also the resulting text shows only a single space between all nonblank fields. While it'd be possible to deal with blank SHE/CHE/MHE tax fields, which precede the unmistakable parcel number fields, there's no simple way to deal with blank MUN tax fields. The closest you could come would be parsing out the available fields using formulas like A1: =LEFT(B!B1,FIND(" 1 ",B!B1)-1) B1: =MID(B!$B1,SUMPRODUCT(LEN($A1:A1)+1)+1, FIND(" ",MID(B!$B1&" ",SUMPRODUCT(LEN($A1:A1)+1)+1,1024))-1) Fill B1 right into C1:IV1. This will parse out each word or token in the description field. Then you'd need to construct records from the fields, making decisions about when MUN/SHE/CHE/MHE tax fields were blank. When would these tax fields be blank? If there's no pattern to it, then you have a mostly manual exercise on your hands. You *may* be better off using GhostScript or some other software to convert the PDF files to HTML if the resulting HTML files mostly preserve the PDF file formatting. |
#9
![]()
Posted to microsoft.public.excel,microsoft.public.excel.programming
|
|||
|
|||
![]()
Bruce wrote...
All of the fields "should" be fixed len, except description. I do not care about the CL CODE after description, so I could use that as part of description (or use right() to trim it off) There is a blank line between each record (I have cleaned off all of the page header info BTW), so what I want to do is to append all records that are between two blank lines, and then I can parse for the len of each field. I am 'trying' to make this happen with a few as possible human steps, as I have to do this every week, on about 60 plus files, some of the files being a few thousand records. You are right, these files ARE from a state goverenment agency, that is why I said that I have to work with the PDF to start with. OK. Oh, BTW, I was mistaken.... I could NOT find PDF2TEST, it is PDF2TXT. There are a few products named pdf2text. I mean the one from http://alkaline.vestris.com/download/WinNT/pdf2text.zip which is a command line utility. That's what I've used to convert your PDF file into a text file. The result doesn't have blank lines between records, but I can work around that. If every record is preceded by a blank line, then insert a column to the left of the imported text, enter 1 in A1 (if you've stripped out the header lines, then row 1 should contain the first line of the first record). Then enter the following formula in A2. =--(TRIM(B1)="") and fill down as far as needed in col A, which should evaluate to 1 when the line is the first line of a record and 0 for subsequent lines. Then if this worksheet were named A, in another worksheet enter the following. A2: =MATCH(1,A!$A:$A,0) A3: =MATCH(1,INDEX(A!$A:$A,A2+1):INDEX(A!$A:$A,65536), 0)+A2 and fill down as far as needed. Then in B2:O2 you'd need formulas like B2: =TRIM(MID(INDEX(B!$B:$B,$A2),2,27)) C2: =TRIM(MID(INDEX(sample!$B:$B,$A2),30,1)) etc., and in P2:U2 formulas like P2: =TRIM(MID(INDEX(sample!$B:$B,$A2+1),1,11)) Q2: =TRIM(MID(INDEX(sample!$B:$B,$A2+1),13,10)) etc., and in V2 the formula V2: =TRIM(MID(INDEX(sample!$B:$B,$A2+1),93,40) &IF(A2+2<A3,INDEX(sample!$B:$B,$A2+2),"")) |
#10
![]()
Posted to microsoft.public.excel,microsoft.public.excel.programming
|
|||
|
|||
![]()
Thanks Harlan,
PDF2TXT, is win based (but seems to have a command line mode) It DOES leave in the blank lines, and appears to have all spaces etc Sorry for the name change.... I am having to use my son's system... I usually can figure out code f(for the most part) but I have more questions than answers here... I do have a few questions Thanks, Mc =--(TRIM(B1)="") - seems to work OK Then if this worksheet were named A, - renamed worksheet to "A" in another worksheet enter the following. - named worksheet to B A2: =MATCH(1,A!$A:$A,0) - in ws "B" A3: :$A,A2+1):INDEX(A!$A:$A,65536),0)+A2 - in WS "B" and fill down as far as needed. Then in B2:O2 you'd need formulas like B2: =TRIM(MID(INDEX(B!$B:$B,$A2),2,27)) - in WS "B" Also I looked up index, but could not quite make out the syntax (array/referance?) What do 2 and 27 represent? C2: =TRIM(MID(INDEX(sample!$B:$B,$A2),30,1)) - What is ws "SAMPLE" ? What do the 30 and 1 represent? etc., and in P2:U2 formulas like P2: =TRIM(MID(INDEX(sample!$B:$B,$A2+1),1,11)) - Again, if I knew what I was trying to do here (since you say "like" on the previous line, I take it I have to modify , but I am not sure... Q2: =TRIM(MID(INDEX(sample!$B:$B,$A2+1),13,10)) - again, what is the 13 and 10 represent etc., and in V2 the formula V2: =TRIM(MID(INDEX(sample!$B:$B,$A2+1),93,40) &IF(A2+2<A3,INDEX(sample!$B:$B,$A2+2),"")) |
#11
![]()
Posted to microsoft.public.excel,microsoft.public.excel.programming
|
|||
|
|||
![]()
I butchered the formulas, didn't I? Every instance of B! or sample!
should be A!, so the result worksheet formulas should be B2 (and similar for C2:O2): =TRIM(MID(INDEX(A!$B:$B,$A2),2,27)) P2 (and similar for Q2:U2): =TRIM(MID(INDEX(A!$B:$B,$A2+1),1,11)) and finally V2: =TRIM(MID(INDEX(A!$B:$B,$A2+1),93,40)&IF(A2+2<A3,I NDEX(A!$B:$B,$A2+2),"")) |
#12
![]()
Posted to microsoft.public.excel,microsoft.public.excel.programming
|
|||
|
|||
![]()
Thanks for all of the assistance!
I tried what you had suggested (strictly cut and paste - as I am not familiar with the INDEX ) but it did not work for me... Here is what I did, and it is working! Column a - pdf paste - the dat from PDF2TXT Column b- paste of row (_value_ of column "d") I use this to re assemble after sorting on column A to remove the page headers) Column c - =IF(LEN(TRIM(A3))0,1,0) - assigns a 0 or a 1 if the record is blank or not Column d - row() used to keep track of rows for column b and also used in column d to see what the next blank is, from the current row Column e - Test to see how many lines in each "group" =IF(C3=1,VLOOKUP(0,C3:D$570,2,FALSE)-ROW(),"") ; 570 is how many rows I have in may test sample) Column f - combined for E rows- =IF(LEN(TRIM(E5))=0,IF(E6=5,A6 & A7 & A8 & A9 & A10,IF(E6= 4, A6 & A7 & A8 & A9,IF(E6= 3, A6 & A7 & A8,IF(E6= 2, A6 & A7,A6 )))),"") - this assumes that there is a maximum of 5 rows per entry. I saw a max of 4 in my test sample..... I then just needed to parse using the mid() function! Might not be the cleanest or the fastest, but it works! and it is one heck of a lot better that what I have been doing! Now, I need to find a way to combine about 60 or so PDFS into one or two files (just to make this process easier....) I am a happy Camper! Again. Thanks for the assistance! Mc "Harlan Grove" wrote in message oups.com... I butchered the formulas, didn't I? Every instance of B! or sample! should be A!, so the result worksheet formulas should be B2 (and similar for C2:O2): =TRIM(MID(INDEX(A!$B:$B,$A2),2,27)) P2 (and similar for Q2:U2): =TRIM(MID(INDEX(A!$B:$B,$A2+1),1,11)) and finally V2: =TRIM(MID(INDEX(A!$B:$B,$A2+1),93,40)&IF(A2+2<A3,I NDEX(A!$B:$B,$A2+2),"")) |
Reply |
Thread Tools | Search this Thread |
Display Modes | |
|
|
![]() |
||||
Thread | Forum | |||
Sorting challange | Excel Worksheet Functions | |||
Give RELEVANT responses to questions. DO NOT give usless list | Excel Worksheet Functions | |||
challange | Excel Discussion (Misc queries) | |||
Avoiding a circular reference or value error while trying to calculate commission - a challange for me! | Excel Programming | |||
a CHALLANGE using OR in an array formula | Excel Worksheet Functions |