简体   繁体   English

将包含表格的PDF文件转换为包含Python表格的文本文档

[英]Converting a PDF file consisting of tables into text document containings tables in Python

I have this pdf file that consists of general tables consisting of names,address,phone number,fax number. 我有这个pdf文件,包括一般表格,包括姓名,地址,电话号码,传真号码。 I want is : 我想要的是:

表

1) read this file and get the content of each row and put it in data base. 1)读取该文件并获取每行的内容并将其放入数据库中。 ie get the name from corresponding name column of the pdf file and store it in database. 即从pdf文件的相应名称列中获取名称并将其存储在数据库中。 and so on with address, phone etc. 地址,电话等等

the main problem is whenever I am reading the pdf file and converting it into text file (As I dont't know any other way to use the data directly without converting it first to text file) the text output is completely messed up that is the format and spacing is not preserved. 主要的问题是每当我读取pdf文件并将其转换为文本文件时(因为我不知道任何其他方式直接使用数据而不将其首先转换为文本文件)文本输出完全混乱,这是格式和间距不保留。 Please suggest a new way to do this or what can be done in the following code: 请建议一种新的方法来执行此操作或在以下代码中可以执行的操作:

import pyPdf
def getPDFContent(path):
    f=open("C:\\Doctor's Data\\delhi\\hospital_delhi1.txt","w")
    content = ""
    text=""
    s=""
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content = pdf.getPage(i).extractText() + "\n\n"
        text+=content
        tokens=content.split("Fax")
        print len(tokens)
        for t in tokens:
            print t #general check
    print s        
    f.close()    
    return text


getPDFContent("C:\\Doctor's Data\\delhi\\hospital_delhi1.pdf")

adding up my output is (Messed Up) : 将我的输出加起来是(Messed Up):

S.NONAME OF THE HOSPITAL/CLINIC ADDRESS OF THE HOSPITAL/CLINIC PHONE NO. S.NOSAME医院/诊所的临床地址/诊所电话号码 FAX NOLIST OF HOSPITALS AT DELHI59Walia Nursing HomeG.60, Laxmi Nagar, Shakarpur, DelhiDr.ASDave - 2224858560Metro Heart InstituteSector A, Faridabad :226358961Ayushman HospitalSector-XII, Dwarka, New Delhi42811114/15/16/18 : 28081723, 4553700163Mohan Eye Institute11-B, Ganga Ram Hospital Marg, New Delhi-6064Shroff Eye CentreKasturba Gandhi Marg, New DelhiReimbursement on CGHS rates without credit basis65Rockland HospitalB-33-34, Qutab Institutional Area, New Delhi66National Heart Institute49, Community Centre, East of Kailash DELHI59Walia医院的传真医院护理之家G.60,Laxmi Nagar,Shakarpur,DelhiDr.ASDave - 2224858560 Metro Heart InstituteAector,Faridabad:226358961Ahushman HospitalSector-XII,Dwarka,New Delhi42811114 / 15/16/18:28081723,4553700163Mohan Eye Institute11-B ,Ganga Ram Hospital Marg,新德里-6064Shroff眼科中心Kasturba Gandhi Marg,新德里CGHS费率报销,无信贷基础65Rockland HospitalB-33-34,Qutab Institutional Area,New Delhi66国家心脏研究所49,社区中心,Kailash东部

Have a look at some already existing python packages: 看看一些已经存在的python包:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM