简体   繁体   English

从书签中提取PDF中的文本

[英]Extract text from PDF at bookmark

I need to extract the text from a PDF right at the place where a bookmark is. 我需要在书签所在的位置从PDF中提取文本。

PDFBox extracts the whole page where the bookmark is, as explained here . PDFBox的提取整个页面,书签,如解释在这里

But i need to extract the text starting with the bookmark. 但我需要书签开始提取文本。

I believe iText can handle this. 我相信iText可以解决这个问题。

Rectangle2D bookmarkRect = getRectFromBookmark(someBookmarkThingy);

FilteredTextRenderListener filter = 
  new FilteredTextRenderListener( new LocationTextExtractionStrategy(), 
                                  new RegionTextRenderFilter( bookmarkRect ));

String bookmarkText = PdfTextExtractor.getTextFromPage(reader, pageNum, filter);

someBookmarkThingy will probably be a PdfDictionary of the bookmark in question. someBookmarkThingy可能是相关书签的PdfDictionary。

WARNING Bookmarks can actually hold just about any action. 警告书签实际上可以包含任何操作。 They typically hold one of several varieties of GoTo* action. 他们通常持有几种GoTo *行动中的一种。

GoTo actions can specify a rectangle, an upper left corner & zoom factor, just a page, and quite a few other variants. GoTo动作可以指定矩形,左上角和缩放因子,只是一个页面,以及其他一些变体。 Anything defining a zoom setting will be affected by the size of the window the PDF is being displayed in . 定义缩放设置的任何内容都将受到PDF显示窗口大小的影响 That includes all of them except the one that explicitly defines a bounding box for the new view. 除了明确定义新视图的边界框之外,其中包括所有这些内容。 You'll have to make an educated guess on what a typical window size is and do your conversions from there. 您必须对典型的窗口大小进行有根据的猜测并从那里进行转换。

You're probably going to need to read the PDF Specification , particularly section 12.6.4.2 "Go-To Actions". 您可能需要阅读PDF规范 ,特别是第12.6.4.2节“Go-To Actions”。 Hmph. 哼。 What you really need will be the section on Destinations, 12.3.2. 你真正需要的是关于目的地的部分,12.3.2。 Page destinations can be defined thusly: 这样就可以定义页面目的地:

  • [pageRef /XYZ left top zoom] [pageRef / XYZ left top zoom]
  • [pageRef /Fit] [pageRef / Fit]
  • [pageRef /FitH top] [pageRef / FitH顶部]
  • [pageRef /FitV left] [pageRef / FitV left]
  • [pageRef /FitR left bottom right top] [pageRef / FitR左下方右下方]
  • [pageRef /FitB] [pageRef / FitB]
  • [pageRef /FitBH top] [pageRef / FitBH顶部]
  • [pageRef /FitBV left] [pageRef / FitBV left]

Have fun! 玩得开心!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM