如何将 PDF 中的字节转换为 Python 中的字符串？

Question

I am trying to convert bytes which I get from book_download_page = requests.get(link) then content = book_download_page.content into string.我正在尝试将从book_download_page = requests.get(link)然后content = book_download_page.content获得的字节转换为字符串。

What I have tried,我尝试过的，

content = book_download_page.content.decode('utf-8')

Error I get,我得到的错误，

'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

Edit- You can try this link for downloading编辑-您可以尝试此链接进行下载

Thank you!谢谢！

Answer 1

PDF contents are made up of tokens, see here: PDF 内容由令牌组成，请参见此处：

Adobe PDF Reference Adobe PDF 参考

You can parse PDFs and extract text, with tools like PoDoFo in C++, PDFBox in Java, and there is also a PDF text stripper in Python. You can parse PDFs and extract text, with tools like PoDoFo in C++, PDFBox in Java, and there is also a PDF text stripper in Python.

import pdfbox

pdf_ref = pdfbox.PDFBox()
pdf_ref.extract_text('directory/originalPDF.pdf')   # Result .txt will be in directory/originalPDF.txt

Simple example paraphrased from python-pdfbox in case if you want to convert other things like images too.从python-pdfbox解释的简单示例，以防您也想转换图像等其他内容。

如何将 PDF 中的字节转换为 Python 中的字符串？

问题描述

1 个解决方案

解决方案1
0 2020-06-25 03:46:35

如何将 PDF 中的字节转换为 Python 中的字符串？

问题描述

1 个解决方案

解决方案1 0 2020-06-25 03:46:35

解决方案1
0 2020-06-25 03:46:35