简体   繁体   English

如何将 PDF 中的字节转换为 Python 中的字符串?

[英]How to convert bytes from PDF to string in Python?

I am trying to convert bytes which I get from book_download_page = requests.get(link) then content = book_download_page.content into string.我正在尝试将从book_download_page = requests.get(link)然后content = book_download_page.content获得的字节转换为字符串。

What I have tried,我尝试过的,

content = book_download_page.content.decode('utf-8')

Error I get,我得到的错误,

'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

Edit- You can try this link for downloading编辑-您可以尝试此链接进行下载

Thank you!谢谢!

PDF contents are made up of tokens, see here: PDF 内容由令牌组成,请参见此处:

Adobe PDF Reference Adobe PDF 参考

You can parse PDFs and extract text, with tools like PoDoFo in C++, PDFBox in Java, and there is also a PDF text stripper in Python. You can parse PDFs and extract text, with tools like PoDoFo in C++, PDFBox in Java, and there is also a PDF text stripper in Python.

import pdfbox

pdf_ref = pdfbox.PDFBox()
pdf_ref.extract_text('directory/originalPDF.pdf')   # Result .txt will be in directory/originalPDF.txt

Simple example paraphrased from python-pdfbox in case if you want to convert other things like images too.python-pdfbox解释的简单示例,以防您也想转换图像等其他内容。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM