简体   繁体   English

刮两列PDF

[英]Scraping two-column PDF

I try to scrape the texts of hundreds of PDFs for a project.我尝试为一个项目抓取数百个 PDF 的文本。

The PDFs have title pages, headers, footers and two columns. PDF 有标题页、页眉、页脚和两列。 I tried the packages pdftools and tabulizer .我尝试了pdftoolstabulizer包。 However, both have their advantages and disadvantages:但是,两者都有其优点和缺点:

  • the pdf_text() function from pdftools reads the PDFs correctly with only some encoding issues which can be solved manually but it does not take the two-column structure into account. pdftools 中的pdf_text() function 可以正确读取 PDF,只有一些可以手动解决的编码问题,但它没有考虑到两列结构。 Moreover, it produces a character vector with as many elements as pages.此外,它会生成一个包含与页面一样多的元素的字符向量。
  • On the contrary, the extract_text() function from tabulizer handles the two-column structure nicely but but produces (in many cases) incorrect results (example below).相反,来自 tabulizer 的extract_text() function 可以很好地处理两列结构,但会产生(在许多情况下)不正确的结果(下面的示例)。 Moreover, it produces a character value with only one element containing the text of the entire PDF document.此外,它生成一个字符值,其中只有一个元素包含整个 PDF 文档的文本。

Based on another post on stackoverflow, I built following function that is based on tabulizer since it handles the two-column structure of the PDFs and outputs a vector containing all pages stored in separate elements:基于stackoverflow上的另一篇文章,我构建了基于tabulizer的function,因为它处理PDF的两列结构并输出包含存储在单独元素中的所有页面的向量:

get_text <- function(url) {
  # Get nunber of pages of PDF
  p <- get_n_pages(url)
  # Initialize a list
  L <- vector(mode = "list", length = 1)
  # Extract text from pdf
  txt <- tabulizer::extract_text(url, pages = seq(1,p))
  # Output: character vector containing all pages
  return(txt)
}

While it works fine in general, there are some PDFs which are not read correctly.虽然它通常工作正常,但有些 PDF 无法正确阅读。 For example,例如,

get_text(url = "https://aplikace.mvcr.cz/sbirka-zakonu/ViewFile.aspx?type=c&id=3592")

Instead of the the correct words and numbers (which contain Czech letters), something like ""\001\002\r\n\b\a\004 \006\t\n\r\n%.\005 \t\031\033 *." is displayed. However, not for all PDFs. Furthermore, please note that pdftools reads it correctly (ignoring the two columns).而不是正确的单词和数字(包含捷克字母),类似 ""\001\002\r\n\b\a\004 \006\t\n\r\n%.\005 \t\ 031\033 *." 被显示。但是,不是所有的 PDF。此外,请注意 pdftools 正确读取它(忽略两列)。

Can anybody help me with this problem or can explain me why it occurs?任何人都可以帮助我解决这个问题或者可以解释它为什么会发生吗?

Thank you very much in advance!非常感谢您!

I encountered this problem for some PDF.我遇到了一些PDF的这个问题。 One solution I used was to convert the numbers to their true value with stringr.我使用的一种解决方案是使用 stringr 将数字转换为它们的真实值。 Here is an example:这是一个例子:

convert_Special_Coding_Numbers <- function(text)
{
  text <- stringr::str_replace_all(string = text, pattern = "\\003", "")
  text <- stringr::str_replace_all(string = text, pattern = "\\025", "2")
  text <- stringr::str_replace_all(string = text, pattern = "\\030", "5")
  text <- stringr::str_replace_all(string = text, pattern = "\\026", "3")
  text <- stringr::str_replace_all(string = text, pattern = "\\034", "9")
  text <- stringr::str_replace_all(string = text, pattern = "\\017", ",")
  text <- stringr::str_replace_all(string = text, pattern = "\\023", "0")
  text <- stringr::str_replace_all(string = text, pattern = "\\027", "4")
  return(text)
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM