[英]How to convert bytes from PDF to string in Python?
I am trying to convert bytes which I get from book_download_page = requests.get(link)
then content = book_download_page.content
into string.我正在尝试将从
book_download_page = requests.get(link)
然后content = book_download_page.content
获得的字节转换为字符串。
What I have tried,我尝试过的,
content = book_download_page.content.decode('utf-8')
Error I get,我得到的错误,
'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
Edit- You can try this link for downloading编辑-您可以尝试此链接进行下载
Thank you!谢谢!
PDF contents are made up of tokens, see here: PDF 内容由令牌组成,请参见此处:
Adobe PDF Reference Adobe PDF 参考
You can parse PDFs and extract text, with tools like PoDoFo in C++, PDFBox in Java, and there is also a PDF text stripper in Python. You can parse PDFs and extract text, with tools like PoDoFo in C++, PDFBox in Java, and there is also a PDF text stripper in Python.
import pdfbox
pdf_ref = pdfbox.PDFBox()
pdf_ref.extract_text('directory/originalPDF.pdf') # Result .txt will be in directory/originalPDF.txt
Simple example paraphrased from python-pdfbox in case if you want to convert other things like images too.从python-pdfbox解释的简单示例,以防您也想转换图像等其他内容。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.