简体繁体中英

When reading in pdf text using readtext is there a way to ensure that readtext respects columns?

原文 2019-12-14 14:44:28 9 1 r/ quanteda/ read-text

The problem is that I have a PDF document formatted in landscape with three columns of text which I am attempting to read into R using readtext(). When it reads the text in, rather than reading down each column in order, it is reading between columns across the same line of text.

To describe it simply, if the first line of each column was just a string of numbers from 1-10 and the second was a string from 11-20 then readtext() reads it in as "1234567891012345678910" rather than as "1234567891011121314..." etc.

Is there a way to specify that readtext() follows columns in my importing process?

Best, Daniel

1 answers

The (current) answer is no. readtext uses the pdftools package to read the pdfs and this doesn't recognize the seperate columns. This has something to do with poppler that is being used to read pdfs. See also issue 4 on github. It is sort of in pdf_data but not easy to retrieve.

Unable to Replace “\r\n-” in Text Extracted from PDF File Using readtext() from readtext Package in R

Using readtext to extract text from XML

Encoding with readtext

What is the right way to make corpus with readtext and quanteda?

R: having trouble using quanteda corpus with readtext

Use of wildcards with readtext()

Ignore errors in readtext r

How to load multiple JSON files into a quanteda corpus using readtext?

Unable to read a .txt zipped file with readtext in R

Error installing package readtext in R studio

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Unable to Replace “\r\n-” in Text Extracted from PDF File Using readtext() from readtext Package in R Using readtext to extract text from XML Encoding with readtext What is the right way to make corpus with readtext and quanteda? R: having trouble using quanteda corpus with readtext Use of wildcards with readtext() Ignore errors in readtext r How to load multiple JSON files into a quanteda corpus using readtext? Unable to read a .txt zipped file with readtext in R Error installing package readtext in R studio

Related Tags

When reading in pdf text using readtext is there a way to ensure that readtext respects columns?

Question

1 answers

solution1 1 ACCPTED 2019-12-14 15:21:36

solution1
1 ACCPTED 2019-12-14 15:21:36