简体   繁体   中英

How to read line by line in pdf file and create a CSV

Here is my pdf 在此处输入图片说明 I found THIS and I used it to scrap my pdf.

6 BEDROOMS
NameAddressUnitSizeKeyRentSq FtMove in DateNotesTenant
Prop #
Texan 261009 West 26th3076x3$4,6952,1368/15/14$1,000 Bonus (1) Park -     

Its pretty mixed up. or Is is because the PDF is formatted in a way which is unreadable? I thought there was a way I could scrap each row and create a CSV with the columns by iteration or something.

Like populate a CSV with columns

T26 | Texan 26          | 1009 West 26th | 307      | 6x3 | ... 
e075| Texan North Campus| 5117 N Lamar   |See below | 6x3 |...

Is there a way around this?

The code snippet that you used has provided some practically unusable data, I don't think that is the way to go. Scraping from a PDF is generally rather difficult, however take a look at pdftables.com: they provide an API for scraping tables from PDF documents which I've found works in the majority of cases - it's your best chance at this i'd say.

You can use Camelot (which is a Python library) to create a script that extracts tabular data from your PDF and export it to a CSV. You can check out the documentation at: http://camelot-py.readthedocs.io . It would be helpful if you could post a link to your PDF. Here's a generic code example:

>>> import camelot
>>> tables = camelot.read_pdf('file.pdf')
>>> type(tables[0].df)
<class 'pandas.core.frame.DataFrame'>
>>> tables[0].to_csv('file.csv')

Disclaimer: I'm the author of the library.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM