I am using the Linux command pdftotext -layout *.pdf
to extract text from some pdf files, for data mining. The resultant text files all reside in a single folder, but they need some pre-processing before they can be used.
Issue 1: The first value of each row in each file that I am trying to access is a barcode, which can be either a 13-digit GTIN code, or a 5-digit PLU code. The problem here is that the GTIN codes are delineated with a single space character, which is hard to replace with a script, as each row also contains a description field which, naturally, also contains single spaces between words. Here I will need to replace a set of 13 numerals plus a space with the same 13 numerals plus two spaces (at least), so that a later stage of the pre-processing can replace all multiple spaces with a tab character.
Issue 2: Another problem I am facing with this pre-processing is the newlines. There are many blank lines between data rows. Some are single blank lines between the data rows, and some are two or more lines. I want to end up with no blank lines between the data rows, but each row will be delineated by a newline character.
Issue 3: The final resulting files each need to be tab separated value files, for importing into a spreadsheet. Some of the descriptions in the data rows may contain commas, so I am using TSV rather than CSV files. I only need a single tab between each value in the row.
(I have replaced spaces with • and newlines with ¶ characters here for clarity.)
9415077026340•Pams•Sour•Cream•&•Chives•Rice•Crackers•100g•••$1.19¶
¶
¶
9415077026296•Pams•BBQ•Chicken•Rice•Crackers•100g•••$1.19¶
¶
61424••••••••••••Yoghurt•Raisins•kg•••$23.90/kg¶
¶
9415077036349•Pams•Sliced•Peaches•In•Juice•410g•••$1.29¶
(I have also replaced tabs with ⇥ characters here for clarity.)
9415077026340⇥Pams•Sour•Cream•&•Chives•Rice•Crackers•100g⇥$1.19¶
9415077026296⇥Pams•BBQ•Chicken•Rice•Crackers•100g⇥$1.19¶
61424⇥Yoghurt•Raisins•kg⇥$23.90/kg¶
9415077036349⇥Pams•Sliced•Peaches•In•Juice•410g⇥$1.29¶
I am slowly learning more about the various Linux script utilities such as sed
/ grep
/ awk
/ tr
, etc. There are many solutions posted in StackOverflow which resolve some of the issues that I am facing, but they are disparate and confusing when I attempt to string them all together in the way that I need them. Some are "close, but not quite" solutions, such as replacing all double newlines with a single newline between each data row. I don't need the extra row between them. I have been looking and trying several different options that are close to what I need. It would be helpful if someone could propose a solution which uses a single utility, such as sed
, to solve all of the issues at once.
Perhaps rquery ( https://github.com/fuyuncat/rquery/releases ) can help you.
[ rquery]$ cat samples/pdf.txt
9415077026340 Pams Sour Cream & Chives Rice Crackers 100g $1.19
9415077026296 Pams BBQ Chicken Rice Crackers 100g $1.19
61424 Yoghurt Raisins kg $23.90/kg
9415077036349 Pams Sliced Peaches In Juice 410g $1.29
[ rquery]$ tab=`echo -e "\t"`
[ rquery]$ ./rq -q "s replace(regreplace(@raw,'(^[\d]+)([ ]+)','\$1${tab}'),' ','${tab}') | f strlen(trim(@raw))!=0" samples/pdf.txt
9415077026340 Pams Sour Cream & Chives Rice Crackers 100g $1.19
9415077026296 Pams BBQ Chicken Rice Crackers 100g $1.19
61424 Yoghurt Raisins kg $23.90/kg
9415077036349 Pams Sliced Peaches In Juice 410g $1.29
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.