简体   繁体   English

当使用 readtext 阅读 pdf 文本时,有没有办法确保 readtext 尊重列?

[英]When reading in pdf text using readtext is there a way to ensure that readtext respects columns?

The problem is that I have a PDF document formatted in landscape with three columns of text which I am attempting to read into R using readtext().问题是我有一个横向格式化的 PDF 文档,其中包含三列文本,我试图使用 readtext() 将其读入 R。 When it reads the text in, rather than reading down each column in order, it is reading between columns across the same line of text.当它读入文本时,它不是按顺序阅读每一列,而是在同一行文本的列之间阅读。

To describe it simply, if the first line of each column was just a string of numbers from 1-10 and the second was a string from 11-20 then readtext() reads it in as "1234567891012345678910" rather than as "1234567891011121314..." etc.简单地描述一下,如果每列的第一行只是 1-10 的一串数字,而第二行是 11-20 的字符串,那么 readtext() 将其读为“1234567891012345678910”而不是“1234567891011121314.. 。“ 等等。

Is there a way to specify that readtext() follows columns in my importing process?有没有办法指定 readtext() 在我的导入过程中跟随列?

Best, Daniel最好的,丹尼尔

The (current) answer is no. (当前)答案是否定的。 readtext uses the pdftools package to read the pdfs and this doesn't recognize the seperate columns. readtext使用 pdftools 包来读取 pdf,这无法识别单独的列。 This has something to do with poppler that is being used to read pdfs.这与用于阅读 pdf 的 poppler 有关。 See also issue 4 on github.另请参阅 github 上的问题 4 It is sort of in pdf_data but not easy to retrieve.它有点像pdf_data但不容易检索。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM