如何使用 google docs 將 PDF 內容提取到 .txt 文件中？

Question

如何以編程方式使用 google docs 以編程方式從 pdf 文件中提取文本？ 我都知道還有其他選擇，但是，我很好奇是否可以將 google docs 用於此類目的。

Answer 1

當使用python將PDF數據檢索為文本數據時，可以使用Drive API v3來實現。 但是需要2個步驟。

將 PDF 文件上傳為 Google 文檔
將 Google 文檔下載為 TXT 文件

在此示例中，使用 Python 快速入門。 詳細信息是https://developers.google.com/drive/v3/web/quickstart/python 。 請閱讀“第 1 步：打開 Drive API”和“第 2 步：安裝 Google 客戶端庫”。 如果你已經認識他們，我很抱歉。

當您使用以下示例腳本時，請進行如下修改。

1. 添加進口

請將以下導入添加到 Quickstart。

import io
from apiclient.http import MediaFileUpload, MediaIoBaseDownload

2. 更改范圍

請將范圍更改為以下。

SCOPES = 'https://www.googleapis.com/auth/drive'

3.改變`main()`

請將 Quickstart 的main()更改為此。

示例腳本

示例腳本可以將 PDF 文件轉換為 TXT 文件。 但是PDF文件中的圖片不能是TXT文件。

def main():
    credentials = get_credentials()
    http = credentials.authorize(httplib2.Http())
    service = discovery.build('drive', 'v3', http=http)

    pdffile = 'sample.pdf' # PDF file
    txtfile = 'sample.txt' # Text file

    mime = 'application/vnd.google-apps.document'
    res = service.files().create(
        body={
            'name': pdffile,
            'mimeType': mime
        },
        media_body=MediaFileUpload(pdffile, mimetype=mime, resumable=True)
    ).execute()

    dl = MediaIoBaseDownload(
        io.FileIO(txtfile, 'wb'),
        service.files().export_media(fileId=res['id'], mimeType="text/plain")
    )
    done = False
    while done is False:
        status, done = dl.next_chunk()
    print("Done.")


if __name__ == '__main__':
    main()

如果我誤解了你的問題，我很抱歉。

腳本添加了快速入門：

from __future__ import print_function
import httplib2
import os
import io

from apiclient import discovery
from oauth2client import client
from oauth2client import tools
from oauth2client.file import Storage
from apiclient.http import MediaFileUpload, MediaIoBaseDownload

try:
    import argparse
    flags = argparse.ArgumentParser(parents=[tools.argparser]).parse_args()
except ImportError:
    flags = None

# If modifying these scopes, delete your previously saved credentials
# at ~/.credentials/drive-python-quickstart.json
SCOPES = 'https://www.googleapis.com/auth/drive'
CLIENT_SECRET_FILE = 'client_secret.json'
APPLICATION_NAME = 'Drive API Python Quickstart'


def get_credentials():
    """Gets valid user credentials from storage.

    If nothing has been stored, or if the stored credentials are invalid,
    the OAuth2 flow is completed to obtain the new credentials.

    Returns:
        Credentials, the obtained credential.
    """
    credential_path = os.path.join("./", 'drive-python-quickstart.json')
    store = Storage(credential_path)
    credentials = store.get()
    if not credentials or credentials.invalid:
        flow = client.flow_from_clientsecrets(CLIENT_SECRET_FILE, SCOPES)
        flow.user_agent = APPLICATION_NAME
        if flags:
            credentials = tools.run_flow(flow, store, flags)
        else:  # Needed only for compatibility with Python 2.6
            credentials = tools.run(flow, store)
        print('Storing credentials to ' + credential_path)
    return credentials


def main():
    credentials = get_credentials()
    http = credentials.authorize(httplib2.Http())
    service = discovery.build('drive', 'v3', http=http)

    pdffile = '../Downloads/sample.pdf'  # PDF file
    txtfile = '../Downloads/sample.txt'  # Text file

    mime = 'application/vnd.google-apps.document'
    res = service.files().create(
        body={
            'name': pdffile,
            'mimeType': mime
        },
        media_body=MediaFileUpload(pdffile, mimetype=mime, resumable=True)
    ).execute()

    dl = MediaIoBaseDownload(
        io.FileIO(txtfile, 'wb'),
        service.files().export_media(fileId=res['id'], mimeType="text/plain")
    )
    done = False
    while done is False:
        status, done = dl.next_chunk()
    print("Done.")


if __name__ == '__main__':
    main()

如何使用 google docs 將 PDF 內容提取到 .txt 文件中？

問題描述

1 個解決方案

解決方案1
4 已采納 2017-05-01 05:21:00

1. 添加進口

2. 更改范圍

3.改變`main()`

示例腳本

如何使用 google docs 將 PDF 內容提取到 .txt 文件中？

問題描述

1 個解決方案

解決方案1 4 已采納 2017-05-01 05:21:00

1. 添加進口

2. 更改范圍

3.改變main()

示例腳本

解決方案1
4 已采納 2017-05-01 05:21:00

3.改變`main()`