刮两列PDF

Question

I try to scrape the texts of hundreds of PDFs for a project.我尝试为一个项目抓取数百个 PDF 的文本。

The PDFs have title pages, headers, footers and two columns. PDF 有标题页、页眉、页脚和两列。 I tried the packages pdftools and tabulizer .我尝试了pdftools和tabulizer包。 However, both have their advantages and disadvantages:但是，两者都有其优点和缺点：

the pdf_text() function from pdftools reads the PDFs correctly with only some encoding issues which can be solved manually but it does not take the two-column structure into account. pdftools 中的pdf_text() function 可以正确读取 PDF，只有一些可以手动解决的编码问题，但它没有考虑到两列结构。 Moreover, it produces a character vector with as many elements as pages.此外，它会生成一个包含与页面一样多的元素的字符向量。
On the contrary, the extract_text() function from tabulizer handles the two-column structure nicely but but produces (in many cases) incorrect results (example below).相反，来自 tabulizer 的extract_text() function 可以很好地处理两列结构，但会产生（在许多情况下）不正确的结果（下面的示例）。 Moreover, it produces a character value with only one element containing the text of the entire PDF document.此外，它生成一个字符值，其中只有一个元素包含整个 PDF 文档的文本。

Based on another post on stackoverflow, I built following function that is based on tabulizer since it handles the two-column structure of the PDFs and outputs a vector containing all pages stored in separate elements:基于stackoverflow上的另一篇文章，我构建了基于tabulizer的function，因为它处理PDF的两列结构并输出包含存储在单独元素中的所有页面的向量：

get_text <- function(url) {
  # Get nunber of pages of PDF
  p <- get_n_pages(url)
  # Initialize a list
  L <- vector(mode = "list", length = 1)
  # Extract text from pdf
  txt <- tabulizer::extract_text(url, pages = seq(1,p))
  # Output: character vector containing all pages
  return(txt)
}

While it works fine in general, there are some PDFs which are not read correctly.虽然它通常工作正常，但有些 PDF 无法正确阅读。 For example,例如，

get_text(url = "https://aplikace.mvcr.cz/sbirka-zakonu/ViewFile.aspx?type=c&id=3592")

Instead of the the correct words and numbers (which contain Czech letters), something like ""\001\002\r\n\b\a\004 \006\t\n\r\n%.\005 \t\031\033 *." is displayed. However, not for all PDFs. Furthermore, please note that pdftools reads it correctly (ignoring the two columns).而不是正确的单词和数字（包含捷克字母），类似 ""\001\002\r\n\b\a\004 \006\t\n\r\n%.\005 \t\ 031\033 *." 被显示。但是，不是所有的 PDF。此外，请注意 pdftools 正确读取它（忽略两列）。

Can anybody help me with this problem or can explain me why it occurs?任何人都可以帮助我解决这个问题或者可以解释它为什么会发生吗？

Thank you very much in advance!非常感谢您！

Answer 1

I encountered this problem for some PDF.我遇到了一些PDF的这个问题。 One solution I used was to convert the numbers to their true value with stringr.我使用的一种解决方案是使用 stringr 将数字转换为它们的真实值。 Here is an example:这是一个例子：

convert_Special_Coding_Numbers <- function(text)
{
  text <- stringr::str_replace_all(string = text, pattern = "\\003", "")
  text <- stringr::str_replace_all(string = text, pattern = "\\025", "2")
  text <- stringr::str_replace_all(string = text, pattern = "\\030", "5")
  text <- stringr::str_replace_all(string = text, pattern = "\\026", "3")
  text <- stringr::str_replace_all(string = text, pattern = "\\034", "9")
  text <- stringr::str_replace_all(string = text, pattern = "\\017", ",")
  text <- stringr::str_replace_all(string = text, pattern = "\\023", "0")
  text <- stringr::str_replace_all(string = text, pattern = "\\027", "4")
  return(text)
}

刮两列PDF

问题描述

1 个解决方案

解决方案1
0 2022-09-26 01:14:36

刮两列PDF

问题描述

1 个解决方案

解决方案1 0 2022-09-26 01:14:36

解决方案1
0 2022-09-26 01:14:36