简体   繁体   English

使用pdfclown从特定区域提取

[英]extracting from specific areas using pdfclown

I am trying to highlight text in PDF with two columns and but the problem is while the extractor extracts the text row wise. 我试图用两列突出显示PDF中的文本,但是问题是提取器明智地提取了文本行。 So the queried text doesn't get matched. 因此查询的文本不匹配。 I was thinking if there is some function in pdfclown which can help me to extract first half of the page ie, first column and then the second one probably by selecting the areas. 我在想pdfclown是否有某些功能可以帮助我提取页面的上半部分,即第一列,然后提取第二半,可能是通过选择区域。

Thanks. 谢谢。

As you talk about text extraction with PDF Clown, I assume you are using the TextExtractor class of that library. 当您谈论使用PDF Clown进行文本提取时,我假设您正在使用该库的TextExtractor类。

This class offers numerous attributes helping to restrict the parsing area: 此类提供了许多属性,有助于限制解析区域:

public void setAreas(List<Rectangle2D> value);
public void setAreaTolerance(double value);
public void setAreaMode(AreaModeEnum value);

setAreas allows you to set the page areas to extract text from, setAreaTolerance allows you to add some tolerance to these areas (essentially enlarging the areas by this value in all directions), and setAreaMode is used to control whether a string must be contained by the area ( Containment ) or merely needs to intersect the area ( Intersection ) to be included in the scan results. setAreas允许您设置要从中提取文本的页面区域, setAreaTolerance允许您向这些区域添加一些公差(实质上是通过各个方向上的该值扩大区域),而setAreaMode用于控制是否必须包含字符串。区域Containment )或仅需要相交的区域Intersection ),以被包括在扫描结果。

How these attributes work, can be witnessed in the TextExtractor method 这些属性如何工作,可以在TextExtractor方法中看到

public Map<Rectangle2D,List<ITextString>> filter(
    List<? extends ITextString> textStrings,
    Rectangle2D... areas
);

which filters the list of all text strings on the page. 过滤页面上所有文本字符串的列表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM