简体   繁体   English

使用 python 获取文本为 csv 格式

[英]Get text to csv format using python

I am able to get the data from pdf to text.我能够将数据从 pdf 获取到文本。 But now i need to get the data in csv format with table structure.但是现在我需要获取具有表结构的 csv 格式的数据。

I tried it to get the table structure with but it didn't happen.Any inputs?我试过用它来获取表结构,但没有成功。有任何输入吗? Also, i'm able to generate it through json. Is there a way to get the result into table csv format?另外,我可以通过 json 生成它。有没有办法将结果转换为表 csv 格式? any inputs?任何输入?

PDF表格

Below is the code i have used.以下是我使用的代码。

import boto3
import time

# Document
s3BucketName = "textractanalysisexample"
documentName = "sheet_example.pdf"

def startJob(s3BucketName, objectName):
   response = None
   client = boto3.client('textract')
   response = client.start_document_text_detection(
   DocumentLocation={
       'S3Object': {
           'Bucket': s3BucketName,
           'Name': objectName
       }
   })
   
   return response["JobId"]
   
def isJobComplete(jobId):
   # For production use cases, use SNS based notification 
   # Details at: https://docs.aws.amazon.com/textract/latest/dg/api-async.html
   time.sleep(5)
   client = boto3.client('textract')
   response = client.get_document_text_detection(JobId=jobId)
   status = response["JobStatus"]
   #print("Job status: {}".format(status))

   while(status == "IN_PROGRESS"):
       time.sleep(5)
       response = client.get_document_text_detection(JobId=jobId)
       status = response["JobStatus"]
       #print("Job status: {}".format(status))

   return status
   
def getJobResults(jobId):

   pages = []

   client = boto3.client('textract')
   response = client.get_document_text_detection(JobId=jobId)
   
   pages.append(response)
   print("Resultset page recieved: {}".format(len(pages)))
   nextToken = None
   if('NextToken' in response):
       nextToken = response['NextToken']

   while(nextToken):

       response = client.get_document_text_detection(JobId=jobId, NextToken=nextToken)

       pages.append(response)
       #print("Resultset page recieved: {}".format(len(pages)))
       nextToken = None
       if('NextToken' in response):
           nextToken = response['NextToken']

   return pages

def lambda_handler(event, context):
   
   jobId = startJob(s3BucketName, documentName)
   #print("Started job with id: {}".format(jobId))
   if(isJobComplete(jobId)):
       response = getJobResults(jobId)
   
   # Print detected text
   for resultPage in response:
       for item in resultPage["Blocks"]:
           if item["BlockType"] == "LINE":
               print (item["Text"]) ```
     


You can import CSV to write to a csv file like so:您可以导入 CSV 以写入 csv 文件,如下所示:

import csv

with open('my_pdf.txt', 'r') as in_file:
    stripped = (line.strip() for line in in_file)
    lines = (line.split(",") for line in stripped if line)
    with open('my_pdf.csv', 'w') as out_file:
        writer = csv.writer(out_file)
        writer.writerow(('title', 'intro'))
        writer.writerows(lines)

You can just put in the rows you need, and this splits your data into comma separated values.您可以只放入所需的行,这会将您的数据拆分为逗号分隔的值。 You can see more information for CSV writer (and csv python in general) here (Python Docs) .您可以在此处(Python 文档)查看有关 CSV 作者(以及一般情况下的 csv python)的更多信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 bash 将 csv 格式列转换为逗号分隔列表 - Converting csv format column to comma separated list using bash 如何使用 Python 将 bigquery 返回的结果转换为 Json 格式? - How to convert results returned from bigquery to Json format using Python? 没有使用 watchtower 将正确的日志记录(python)格式发送到 Cloudwatch - Correct logging(python) format is not being sent to Cloudwatch using watchtower CSV 到 Neptune 批量格式 - CSV to Neptune bulk format 如何在 python 中使用 presignedurl 将用户(csv)导入 AWS cognito - How to import users(csv) to AWS cognito using presignedurl in python 使用 python 在 Google Cloud Vision 中逐行检测文本 - Line by Line Text Detection in Google Cloud Vision using python Google Cloud Storage XML 文件转换为 CSV 或 JSON 格式 - Google Cloud Storage XML file Conversion to CSV or JSON format 使用 python 查询到 BigQuery(Python 字符串格式)时出现问题 - Problems querying with python to BigQuery (Python String Format) 使用 python 获取 firebase 存储中所有文件夹的列表 - get list of all folders in firebase storage using python 如何使用 python 获取 AWS 实例总大小 memory? - How to get AWS instance total memory size using python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM