简体   繁体   中英

Pre-processing multiple text files from a pdf using just pdftotext and sed in a bash script, if possible

I am using the Linux command pdftotext -layout *.pdf to extract text from some pdf files, for data mining. The resultant text files all reside in a single folder, but they need some pre-processing before they can be used.

Issues

Issue 1: The first value of each row in each file that I am trying to access is a barcode, which can be either a 13-digit GTIN code, or a 5-digit PLU code. The problem here is that the GTIN codes are delineated with a single space character, which is hard to replace with a script, as each row also contains a description field which, naturally, also contains single spaces between words. Here I will need to replace a set of 13 numerals plus a space with the same 13 numerals plus two spaces (at least), so that a later stage of the pre-processing can replace all multiple spaces with a tab character.

Issue 2: Another problem I am facing with this pre-processing is the newlines. There are many blank lines between data rows. Some are single blank lines between the data rows, and some are two or more lines. I want to end up with no blank lines between the data rows, but each row will be delineated by a newline character.

Issue 3: The final resulting files each need to be tab separated value files, for importing into a spreadsheet. Some of the descriptions in the data rows may contain commas, so I am using TSV rather than CSV files. I only need a single tab between each value in the row.

Sample rows

(I have replaced spaces with • and newlines with ¶ characters here for clarity.)

9415077026340•Pams•Sour•Cream•&•Chives•Rice•Crackers•100g•••$1.19¶
¶
¶
9415077026296•Pams•BBQ•Chicken•Rice•Crackers•100g•••$1.19¶
¶
61424••••••••••••Yoghurt•Raisins•kg•••$23.90/kg¶
¶
9415077036349•Pams•Sliced•Peaches•In•Juice•410g•••$1.29¶

Intended result

(I have also replaced tabs with ⇥ characters here for clarity.)

9415077026340⇥Pams•Sour•Cream•&•Chives•Rice•Crackers•100g⇥$1.19¶
9415077026296⇥Pams•BBQ•Chicken•Rice•Crackers•100g⇥$1.19¶
61424⇥Yoghurt•Raisins•kg⇥$23.90/kg¶
9415077036349⇥Pams•Sliced•Peaches•In•Juice•410g⇥$1.29¶

What have I tried?

I am slowly learning more about the various Linux script utilities such as sed / grep / awk / tr , etc. There are many solutions posted in StackOverflow which resolve some of the issues that I am facing, but they are disparate and confusing when I attempt to string them all together in the way that I need them. Some are "close, but not quite" solutions, such as replacing all double newlines with a single newline between each data row. I don't need the extra row between them. I have been looking and trying several different options that are close to what I need. It would be helpful if someone could propose a solution which uses a single utility, such as sed , to solve all of the issues at once.

Perhaps rquery ( https://github.com/fuyuncat/rquery/releases ) can help you.

[ rquery]$ cat samples/pdf.txt
9415077026340 Pams Sour Cream & Chives Rice Crackers 100g   $1.19


9415077026296 Pams BBQ Chicken Rice Crackers 100g   $1.19

61424            Yoghurt Raisins kg   $23.90/kg

9415077036349 Pams Sliced Peaches In Juice 410g   $1.29

[ rquery]$ tab=`echo -e "\t"`
[ rquery]$ ./rq -q "s replace(regreplace(@raw,'(^[\d]+)([ ]+)','\$1${tab}'),'   ','${tab}') | f strlen(trim(@raw))!=0" samples/pdf.txt
9415077026340   Pams Sour Cream & Chives Rice Crackers 100g     $1.19
9415077026296   Pams BBQ Chicken Rice Crackers 100g     $1.19
61424   Yoghurt Raisins kg      $23.90/kg
9415077036349   Pams Sliced Peaches In Juice 410g       $1.29

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM