使用 python 获取文本为 csv 格式

Question

I am able to get the data from pdf to text.我能够将数据从 pdf 获取到文本。 But now i need to get the data in csv format with table structure.但是现在我需要获取具有表结构的 csv 格式的数据。

I tried it to get the table structure with but it didn't happen.Any inputs?我试过用它来获取表结构，但没有成功。有任何输入吗？ Also, i'm able to generate it through json. Is there a way to get the result into table csv format?另外，我可以通过 json 生成它。有没有办法将结果转换为表 csv 格式？ any inputs?任何输入？

Below is the code i have used.以下是我使用的代码。

import boto3
import time

# Document
s3BucketName = "textractanalysisexample"
documentName = "sheet_example.pdf"

def startJob(s3BucketName, objectName):
   response = None
   client = boto3.client('textract')
   response = client.start_document_text_detection(
   DocumentLocation={
       'S3Object': {
           'Bucket': s3BucketName,
           'Name': objectName
       }
   })
   
   return response["JobId"]
   
def isJobComplete(jobId):
   # For production use cases, use SNS based notification 
   # Details at: https://docs.aws.amazon.com/textract/latest/dg/api-async.html
   time.sleep(5)
   client = boto3.client('textract')
   response = client.get_document_text_detection(JobId=jobId)
   status = response["JobStatus"]
   #print("Job status: {}".format(status))

   while(status == "IN_PROGRESS"):
       time.sleep(5)
       response = client.get_document_text_detection(JobId=jobId)
       status = response["JobStatus"]
       #print("Job status: {}".format(status))

   return status
   
def getJobResults(jobId):

   pages = []

   client = boto3.client('textract')
   response = client.get_document_text_detection(JobId=jobId)
   
   pages.append(response)
   print("Resultset page recieved: {}".format(len(pages)))
   nextToken = None
   if('NextToken' in response):
       nextToken = response['NextToken']

   while(nextToken):

       response = client.get_document_text_detection(JobId=jobId, NextToken=nextToken)

       pages.append(response)
       #print("Resultset page recieved: {}".format(len(pages)))
       nextToken = None
       if('NextToken' in response):
           nextToken = response['NextToken']

   return pages

def lambda_handler(event, context):
   
   jobId = startJob(s3BucketName, documentName)
   #print("Started job with id: {}".format(jobId))
   if(isJobComplete(jobId)):
       response = getJobResults(jobId)
   
   # Print detected text
   for resultPage in response:
       for item in resultPage["Blocks"]:
           if item["BlockType"] == "LINE":
               print (item["Text"]) ```

Answer 1

You can import CSV to write to a csv file like so:您可以导入 CSV 以写入 csv 文件，如下所示：

import csv

with open('my_pdf.txt', 'r') as in_file:
    stripped = (line.strip() for line in in_file)
    lines = (line.split(",") for line in stripped if line)
    with open('my_pdf.csv', 'w') as out_file:
        writer = csv.writer(out_file)
        writer.writerow(('title', 'intro'))
        writer.writerows(lines)

You can just put in the rows you need, and this splits your data into comma separated values.您可以只放入所需的行，这会将您的数据拆分为逗号分隔的值。 You can see more information for CSV writer (and csv python in general) here (Python Docs) .您可以在此处（Python 文档）查看有关 CSV 作者（以及一般情况下的 csv python）的更多信息。

使用 python 获取文本为 csv 格式

问题描述

1 个解决方案

解决方案1
0 2020-07-28 21:58:28

使用 python 获取文本为 csv 格式

问题描述

1 个解决方案

解决方案1 0 2020-07-28 21:58:28

解决方案1
0 2020-07-28 21:58:28