[英]How to extract PDF content into .txt file with google docs?
如何以編程方式使用 google docs 以編程方式從 pdf 文件中提取文本? 我都知道還有其他選擇,但是,我很好奇是否可以將 google docs 用於此類目的。
當使用python將PDF數據檢索為文本數據時,可以使用Drive API v3來實現。 但是需要2個步驟。
在此示例中,使用 Python 快速入門。 詳細信息是https://developers.google.com/drive/v3/web/quickstart/python 。 請閱讀“第 1 步:打開 Drive API”和“第 2 步:安裝 Google 客戶端庫”。 如果你已經認識他們,我很抱歉。
當您使用以下示例腳本時,請進行如下修改。
請將以下導入添加到 Quickstart。
import io
from apiclient.http import MediaFileUpload, MediaIoBaseDownload
請將范圍更改為以下。
SCOPES = 'https://www.googleapis.com/auth/drive'
main()
請將 Quickstart 的main()
更改為此。
示例腳本可以將 PDF 文件轉換為 TXT 文件。 但是PDF文件中的圖片不能是TXT文件。
def main():
credentials = get_credentials()
http = credentials.authorize(httplib2.Http())
service = discovery.build('drive', 'v3', http=http)
pdffile = 'sample.pdf' # PDF file
txtfile = 'sample.txt' # Text file
mime = 'application/vnd.google-apps.document'
res = service.files().create(
body={
'name': pdffile,
'mimeType': mime
},
media_body=MediaFileUpload(pdffile, mimetype=mime, resumable=True)
).execute()
dl = MediaIoBaseDownload(
io.FileIO(txtfile, 'wb'),
service.files().export_media(fileId=res['id'], mimeType="text/plain")
)
done = False
while done is False:
status, done = dl.next_chunk()
print("Done.")
if __name__ == '__main__':
main()
如果我誤解了你的問題,我很抱歉。
腳本添加了快速入門:
from __future__ import print_function
import httplib2
import os
import io
from apiclient import discovery
from oauth2client import client
from oauth2client import tools
from oauth2client.file import Storage
from apiclient.http import MediaFileUpload, MediaIoBaseDownload
try:
import argparse
flags = argparse.ArgumentParser(parents=[tools.argparser]).parse_args()
except ImportError:
flags = None
# If modifying these scopes, delete your previously saved credentials
# at ~/.credentials/drive-python-quickstart.json
SCOPES = 'https://www.googleapis.com/auth/drive'
CLIENT_SECRET_FILE = 'client_secret.json'
APPLICATION_NAME = 'Drive API Python Quickstart'
def get_credentials():
"""Gets valid user credentials from storage.
If nothing has been stored, or if the stored credentials are invalid,
the OAuth2 flow is completed to obtain the new credentials.
Returns:
Credentials, the obtained credential.
"""
credential_path = os.path.join("./", 'drive-python-quickstart.json')
store = Storage(credential_path)
credentials = store.get()
if not credentials or credentials.invalid:
flow = client.flow_from_clientsecrets(CLIENT_SECRET_FILE, SCOPES)
flow.user_agent = APPLICATION_NAME
if flags:
credentials = tools.run_flow(flow, store, flags)
else: # Needed only for compatibility with Python 2.6
credentials = tools.run(flow, store)
print('Storing credentials to ' + credential_path)
return credentials
def main():
credentials = get_credentials()
http = credentials.authorize(httplib2.Http())
service = discovery.build('drive', 'v3', http=http)
pdffile = '../Downloads/sample.pdf' # PDF file
txtfile = '../Downloads/sample.txt' # Text file
mime = 'application/vnd.google-apps.document'
res = service.files().create(
body={
'name': pdffile,
'mimeType': mime
},
media_body=MediaFileUpload(pdffile, mimetype=mime, resumable=True)
).execute()
dl = MediaIoBaseDownload(
io.FileIO(txtfile, 'wb'),
service.files().export_media(fileId=res['id'], mimeType="text/plain")
)
done = False
while done is False:
status, done = dl.next_chunk()
print("Done.")
if __name__ == '__main__':
main()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.