简体   繁体   English

检测pdf中水印的位置

[英]Detect position of watermark in a pdf

I am on ubuntu. 我在ubuntu上。

I have a pdf file with pages divided into a grid. 我有一个PDF文件,页面分为多个网格。 Each block of the grid contains name/age/dob/photo of a candidate. 网格的每个块均包含候选人的姓名/年龄/出生日期/照片。 some records have a watermark "disqualified" 一些记录的水印“不合格”

I need to scrape his pdf, with disqualified candidates in a separate list. 我需要抓取他的pdf文件,并在单独的列表中列出不合格的候选人。 Using pyPdf I was able to get individual records, but it also includes watermarked candidates. 使用pyPdf,我可以获取个人记录,但其中也包含带水印的候选对象。

How to detect the watermark? 如何检测水印? If I can get the coordinates of the watermark, how do I match it with the candidate? 如果我能得到水印的坐标,该如何与候选者进行匹配?

I am open to solutions other than python pyPdf 我愿意接受除python pyPdf以外的解决方案

(Actually this is not an answer but merely an analysis to bit for a comment.) (实际上,这不是答案,而只是分析以征求评论。)

I don't know pyPdf (or any python PDF classes) myself, but here is how the watermark is created for a sample entry; 我本人不了解pyPdf(或任何python PDF类),但是这里是如何为示例条目创建水印的; based upon this, anyone knowing pyPDF well enough, may more easily advice. 基于此,对pyPDF足够了解的任何人都可以更轻松地提出建议。

The Roundup 综述

Depending on how pyPDF (or other python PDF classes) allows access to the page content, there are two major basic approaches: 根据pyPDF(或其他python PDF类)如何允许访问页面内容,有两种主要的基本方法:

  1. If the class returns information on content (text and image) in their order in the page content stream: The watermark image xobject is referred to right before the data of the entry. 如果该类在页面内容流中按顺序返回有关内容(文本和图像)的信息,则:在条目数据之前立即引用水印图像xobject。 Thus, any entry preceded by the drawing of a xobject image is marked. 因此,标记了xobject图像绘制之前的所有条目。

  2. If otherwise the information are not given in the order indicated by the page content stream, coordinate comparison must be used which per se is quite straight forward. 如果没有按照页面内容流指示的顺序给出信息,则必须使用坐标比较,这本身就很简单。 In that case it might be of interest that the images are inserted with a [0.1 0 0 0.1 0 0] transformation matrix in action while the text is drawn with an identity transformation matrix. 在那种情况下,可能会感兴趣的是,在操作时使用[0.1 0 0 0.1 0 0]变换矩阵插入图像,而用同一性变换矩阵绘制文本。

The Details 细节

This is entry # 200; 这是条目#200; the other watermarked entry is constructed similarly: 另一个带水印的条目的构造类似:

具有DELETED水印的数据集200

Watermarking is done by means of an image xobject. 水印是通过图像xobject完成的。 There is but one image xobject defined for the page used by both watermarked entries: 为两个带水印的条目使用的页面定义了一个图像xobject:

4 0 obj
<</Type/Page/MediaBox [0 0 595 841]
/Rotate 0/Parent 3 0 R
/Resources<</ProcSet[/PDF /ImageC /ImageI /Text]
    /ColorSpace 18 0 R
    /ExtGState 19 0 R
    /XObject 20 0 R
    /Font 21 0 R
    >>
/Contents 5 0 R
>>
endobj 
20 0 obj
<</R17
17 0 R>>
endobj
17 0 obj
<</Subtype/Image
/ColorSpace 16 0 R
/Width 128
/Height 88
/BitsPerComponent 8
/Filter/FlateDecode/Length 463>>stream 
[...]
endstream
endobj 

In the content stream this xobject /R17 is inserted right before the data of the entry itself is drawn: 在内容流中,此xobject / R17会在绘制条目本身的数据之前插入:

q 0.1 0 0 0.1 0 0 cm
[...]
q 1045 0 0 495 462.5 6510.5 cm
/R17 Do
Q
q
10 0 0 10 0 0 cm BT
0.000487366 Tc
/R10 8 Tf
1 0 0 1 86 650.75 Tm
(Sex : Male)Tj
0.000304794 Tc
-64 0 Td
(Age : 43)Tj
-0.000140686 Tc
-1 11.05 Td
(House No :)Tj
-0.00002085 Tc
1 31.95 Td
(Name :)Tj
0.00008575 Tc
/R12 7.15 Tf
25.5 17.8 Td
( 200 )Tj
ET
Q
1547.5 6475 485 535.5 re
S
q
10 0 0 10 0 0 cm BT
-0.000403137 Tc
/R14 8 Tf
1 0 0 1 145.1 708.5 Tm
(XVX0001081)Tj
0.000421651 Tc
/R14 7.05 Tf
-90.35 -14.95 Td
(Ramesh Kumar)Tj
0.000373332 Tc
/R10 7.05 Tf
-33 -12.75 Td
(Father's )Tj
0.000193787 Tc
7.3 TL
(Name)'
0.00037774 Tc
/R14 7.05 Tf
40.25 1.8 Td
(Ram Singh)Tj
0 Tc
2.5 -11.85 Td
(37)Tj
0.00137196 Tc
/R12 7.15 Tf
-5.25 13.35 Td
(:)Tj

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM