简体繁体 English

当使用 readtext 阅读 pdf 文本时，有没有办法确保 readtext 尊重列？

[英]When reading in pdf text using readtext is there a way to ensure that readtext respects columns?

原文 2019-12-14 14:44:28 1 1 r/ quanteda/ read-text

The problem is that I have a PDF document formatted in landscape with three columns of text which I am attempting to read into R using readtext().问题是我有一个横向格式化的 PDF 文档，其中包含三列文本，我试图使用 readtext() 将其读入 R。 When it reads the text in, rather than reading down each column in order, it is reading between columns across the same line of text.当它读入文本时，它不是按顺序阅读每一列，而是在同一行文本的列之间阅读。

To describe it simply, if the first line of each column was just a string of numbers from 1-10 and the second was a string from 11-20 then readtext() reads it in as "1234567891012345678910" rather than as "1234567891011121314..." etc.简单地描述一下，如果每列的第一行只是 1-10 的一串数字，而第二行是 11-20 的字符串，那么 readtext() 将其读为“1234567891012345678910”而不是“1234567891011121314.. 。“ 等等。

Is there a way to specify that readtext() follows columns in my importing process?有没有办法指定 readtext() 在我的导入过程中跟随列？

Best, Daniel最好的，丹尼尔

1 个解决方案

The (current) answer is no. （当前）答案是否定的。 readtext uses the pdftools package to read the pdfs and this doesn't recognize the seperate columns. readtext使用 pdftools 包来读取 pdf，这无法识别单独的列。 This has something to do with poppler that is being used to read pdfs.这与用于阅读 pdf 的 poppler 有关。 See also issue 4 on github.另请参阅 github 上的问题 4 。 It is sort of in pdf_data but not easy to retrieve.它有点像pdf_data但不容易检索。

无法使用 R 中的 readtext Package 中的 readtext() 替换从 PDF 文件中提取的文本中的“\r\n-” - Unable to Replace “\r\n-” in Text Extracted from PDF File Using readtext() from readtext Package in R

使用 readtext 从 XML 中提取文本 - Using readtext to extract text from XML

使用readtext进行编码 - Encoding with readtext

使用 readtext 和 quanteda 制作语料库的正确方法是什么？ - What is the right way to make corpus with readtext and quanteda?

R：在阅读文本中使用 quanteda 语料库时遇到问题 - R: having trouble using quanteda corpus with readtext

在readtext（）中使用通配符 - Use of wildcards with readtext()

忽略 readtext r 中的错误 - Ignore errors in readtext r

如何使用 readtext 将多个 JSON 文件加载到 quanteda 语料库中？ - How to load multiple JSON files into a quanteda corpus using readtext?

无法读取R中具有readtext的.txt压缩文件 - Unable to read a .txt zipped file with readtext in R

在 R 工作室中安装 package 读取文本时出错 - Error installing package readtext in R studio

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 无法使用 R 中的 readtext Package 中的 readtext() 替换从 PDF 文件中提取的文本中的“\r\n-” - Unable to Replace “\r\n-” in Text Extracted from PDF File Using readtext() from readtext Package in R 使用 readtext 从 XML 中提取文本 - Using readtext to extract text from XML 使用readtext进行编码 - Encoding with readtext 使用 readtext 和 quanteda 制作语料库的正确方法是什么？ - What is the right way to make corpus with readtext and quanteda? R：在阅读文本中使用 quanteda 语料库时遇到问题 - R: having trouble using quanteda corpus with readtext 在readtext（）中使用通配符 - Use of wildcards with readtext() 忽略 readtext r 中的错误 - Ignore errors in readtext r 如何使用 readtext 将多个 JSON 文件加载到 quanteda 语料库中？ - How to load multiple JSON files into a quanteda corpus using readtext? 无法读取R中具有readtext的.txt压缩文件 - Unable to read a .txt zipped file with readtext in R 在 R 工作室中安装 package 读取文本时出错 - Error installing package readtext in R studio

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM