[英]extracting from specific areas using pdfclown
I am trying to highlight text in PDF with two columns and but the problem is while the extractor extracts the text row wise. 我试图用两列突出显示PDF中的文本,但是问题是提取器明智地提取了文本行。 So the queried text doesn't get matched. 因此查询的文本不匹配。 I was thinking if there is some function in pdfclown
which can help me to extract first half of the page ie, first column and then the second one probably by selecting the areas. 我在想pdfclown
是否有某些功能可以帮助我提取页面的上半部分,即第一列,然后提取第二半,可能是通过选择区域。
Thanks. 谢谢。
As you talk about text extraction with PDF Clown, I assume you are using the TextExtractor
class of that library. 当您谈论使用PDF Clown进行文本提取时,我假设您正在使用该库的TextExtractor
类。
This class offers numerous attributes helping to restrict the parsing area: 此类提供了许多属性,有助于限制解析区域:
public void setAreas(List<Rectangle2D> value);
public void setAreaTolerance(double value);
public void setAreaMode(AreaModeEnum value);
setAreas
allows you to set the page areas to extract text from, setAreaTolerance
allows you to add some tolerance to these areas (essentially enlarging the areas by this value in all directions), and setAreaMode
is used to control whether a string must be contained by the area ( Containment
) or merely needs to intersect the area ( Intersection
) to be included in the scan results. setAreas
允许您设置要从中提取文本的页面区域, setAreaTolerance
允许您向这些区域添加一些公差(实质上是通过各个方向上的该值扩大区域),而setAreaMode
用于控制是否必须包含字符串。区域 ( Containment
)或仅需要相交的区域 ( Intersection
),以被包括在扫描结果。
How these attributes work, can be witnessed in the TextExtractor
method 这些属性如何工作,可以在TextExtractor
方法中看到
public Map<Rectangle2D,List<ITextString>> filter(
List<? extends ITextString> textStrings,
Rectangle2D... areas
);
which filters the list of all text strings on the page. 过滤页面上所有文本字符串的列表。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.