从 .pdf 中提取粗体和带下划线的文本

Question

I need to extract text from pdf.我需要从pdf中提取文本。 But the pdf has some bold and underlined texts.但是pdf有一些粗体和带下划线的文本。 I tried MyPDF2 but getting error while trying to read those pdfs containing formatted texts.我尝试了 MyPDF2，但在尝试阅读那些包含格式化文本的 pdf 时出错。

    import PyPDF2
    pdf_file = open('Downloads/th.pdf','rb')
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    number_of_pages = read_pdf.getNumPages()
    page = read_pdf.getPage(0)
    page_content = page.extractText()
    print (page_content)

Output输出

    ˘ˇˆˆ˝˛˚˜ ˜˚!˘˘ˇˆ˙˛˝˚˜˚ !ˆ"#$ˆ%&'˛"˝#$%˝˚'(˚˛)˛˝*+!-.$ˆ˚˛˚˛˘/˛˛0˛122/ 
    ˘˛˘˚˘˚2ˆ$".#$ˆ%˘˛˛$ˆ$%#$ˆ%˛˛˛˛˝˝(0/ 0$%˙˚˙3#"$˘--4˛0˚! 
    ˆ"#$ˆ%56272˛ˇ5'˛6222˛'4˘8(9˛(˜˚˛&˙˙˙˙˙

Answer 1

I was using Python 3.6 and the PyPDF2 moduele:我使用的是 Python 3.6 和 PyPDF2 模块：

Get and install Python 3获取并安装 Python 3
Install PyPDF2 module using PIP.使用 PIP 安装 PyPDF2 模块。 Run in terminal (or CMD/PowerShell in windows): pip install PyPDF2在终端（或 Windows 中的 CMD/PowerShell）中运行：pip install PyPDF2
Run this code in the python console as in the tutorial , for reading the PDF file and extracting the text:像教程一样在 python 控制台中运行此代码，以读取 PDF 文件并提取文本：
```
 import PyPDF2 pdfFileObj = open('meetingminutes.pdf', 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObj) pageObj = pdfReader.getPage(0) pageObj.extractText()
```

从 .pdf 中提取粗体和带下划线的文本

问题描述

1 个解决方案

解决方案1
2 2019-01-17 12:36:22

从 .pdf 中提取粗体和带下划线的文本

问题描述

1 个解决方案

解决方案1 2 2019-01-17 12:36:22

解决方案1
2 2019-01-17 12:36:22