简体   繁体   中英

tabula extract table from pdf remove line break

I have a table with wrapped text in a pdf file

在此处输入图像描述

I used tabula to extract table from the pdf file

file1 = "path_to_pdf_file"
table = tabula.read_pdf(file1,pages=1,lattice=True)
table[0]

However, the end result looking like this:

在此处输入图像描述

is there a way to interpret line break or wrapped text for table in pdf as its own row? not extra rows?

End result should be looking like this using tabula:

在此处输入图像描述

You need to add a parameter. Replace

file1 = "path_to_pdf_file"
table = tabula.read_pdf(file1,pages=1)
table[0]

with

file1 = "path_to_pdf_file"
table = tabula.read_pdf(file1,pages=1, lattice = True)
table[0]

All this according to the documention here

Here is an example:

Se the article "https://effectivehealthcare.ahrq.gov/sites/default/files/pdf/methods-guidance-tests-bias_methods.pdf"

import tabula
import io
import pandas as pd

file1 = r"C:\Users\s-degossondevarennes\.......\Desktop\methods-guidance-tests-bias_methods.pdf"
table = tabula.read_pdf(file1,pages=3,lattice=True, )

df = table[0]
df = df.drop(['Unnamed: 1','Unnamed: 2','Description','Unnamed: 3'],axis=1)
df

returns:

     Unnamed: 0  \
0                                    NaN   
1                        Spectrum effect   
2                           Context bias   
3                         Selection bias   
4                                    NaN   
5            Variation in test execution   
6           Variation in test technology   
7                      Treatment paradox   
8               Disease progression bias   
9                                    NaN   
10     Inappropriate reference\rstandard   
11        Differential verification bias   
12             Partial verification bias   
13                                   NaN   
14                           Review bias   
15                  Clinical review bias   
16                    Incorporation bias   
17                  Observer variability   
18                                   NaN   
19    Handling of indeterminate\rresults   
20  Arbitrary choice of threshold\rvalue   

                            Source of Systematic Bias  
0                                          Population  
1   Tests may perform differently in various sampl...  
2   Prevalence of the target condition varies acco...  
3   The selection process determines the compositi...  
4                Test Protocol: Materials and Methods  
5   A sufficient description of the execution of i...  
6   When the characteristics of a medical test cha...  
7   Occurs when treatment is started on the basis ...  
8   Occurs when the index test is performed an unu...  
9       Reference Standard and Verification Procedure  
10  Errors of imperfect reference standard bias th...  
11  Part of the index test results is verified by ...  
12  Only a selected sample of patients who underwe...  
13                                     Interpretation  
14  Interpretation of the index test or reference ...  
15  Availability of clinical data such as age, sex...  
16  The result of the index test is used to establ...  
17  The reproducibility of test results is one det...  
18                                           Analysis  
19  A medical test can produce an uninterpretable ...  
20  The selection of the threshold value for the i...  

The three dots in the column Source of Systematic Bias show that everything that was in that cell, with line breaks i considered as a single cell (item), not multiple cells. Another proof of that is

df.iloc[2,1]

returns the cell content:

'Prevalence of the target condition varies according to setting and may affect\restimates of test performance. Interpreters may consider test results to be\rpositive more frequently in settings with higher disease prevalence, which may\ralso affect estimates of test performance.'

There must be something with your pdf. If it's available online, share the link and I'll take a look.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM