简体   繁体   中英

ghostscript to remove only colored text from pdf

I am in the process of reading PDF files. I would like to remove any colored text (ie leave only black text and images) I tried ghostscript

gs -o no-more-texts.pdf -sDEVICE=pdfwrite -dFILTERTEXT   Original.pdf

it successfully deletes all text, and leave the images intact, How can I modify the gs command to remove only colored text(red,blue...).

if there are other modules that are able to this,I am open to suggestions

The device which does this doesn't have that capability, so you cna't modify the Ghostscript command line to do whqat you want.

There are three ways you can tackle this in Ghostscript;

  • You could modify the PDF interpreter, which is written in PostScript
  • You could modify the pdfwrite device, which is written in C
  • You could modify the filtering device, which is also written in C.

There are some points you need to consider no matter which tool you use. Firstly what exactly do you mean by 'coloured text' or 'black text and images'?

The PDF specification allows for colour to be specified in a wide variety of different colour spaces. Gray, RGB, CMYK, Lab, CalGray, CalRGB, ICCBased, Separation, DeviceN in addition there are Indexed colour spaces which may have a base space of any of the previous spaces, and Pattern colour spaces.

What are you going to consider 'black' in neach of those spaces? Obviously DeviceGray is easy, 0 is black, anything else is a shade of gray, but what about RGB? Are you only going to consider 0,0,0 as black? What if it's an ICCBased space?

Text can have two colours, a stroke and a fill colour and they can be specified differently. They can even be specified in different colour spaces. You need to think about how you plan to handle that.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM