简体   繁体   中英

Find and replace pdftotext generated image character in .txt file

I used PHP's pdftotext to create a lot of.txt files from pdf's.

Used it like this, which works perfectly for all the text parts in all the files:

system("pdftotext -raw dir/$pdf_file 2>&1");

THE PROBLEM

However, in the new.txt file all the images from the pdf's appear as:

  • 'FF' when opening the file in FTP
  • char '%0C' with urlencode in the browser (fopen)
  • an arrow up without urlencode (fopen)
  • ^L, when using less on the command line (in CentOs 7), where even sed 's/^L//g' on a single filename is not working.

So, in all those views, I get different ways to work with this weird char.

THE QUESTION

After trying so many code for a week, I am still looking for a way to find and delete this weird image char from all the.txt files.

Is there a solution for this?

Or, what is the smart thing to do here? Working with a php file with code, or on the command line? I am kind of lost on this one now.

The code convention whilst printing Plain Text is that FF usually means FormFeed it is a Control Code to the printer

↑ 12 00/12 14 %0C FF (CtrL=^L) FORM FEED (Page Break)

This is a way to indicate / eject an End Of Page, so you should see one at the division between pages.

There is a switch to remove/exclude them so try,

system("pdftotext -raw -nopgbrk dir/$pdf_file 2>&1");

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM