如何使用 python 从 pdf 中提取粗体文本？

Question

The list below provides examples of items and services that should not be billed separately.下面的列表提供了不应单独计费的项目和服务的示例。 Please note that the list is not all inclusive.请注意，该列表并非包含所有内容。

1. Surgical rooms and services – To include surgical suites, major and minor, treatment rooms, endoscopy labs, cardiac cath labs, X-ray. 1. 手术室和服务——包括手术室、主要和次要手术室、治疗室、内窥镜实验室、心导管实验室、X 射线。

2. Facility Basic Charges - pulmonary and cardiology procedural rooms. 2. 设施基本费用- 肺和心脏手术室。 The hospital's charge for surgical suites and services shall include the entire above listed nursing personnel services, supplies, and equipment医院对手术室和服务的收费应包括上述全部护理人员服务、用品和设备

I want output like:我想要 output 像：

Surgical rooms and services手术室和服务
Facility Basic Charges设施基本费用

there is first sentence also bold but we need to omit that sentence, we need to extract only those text which are represented with numbers第一个句子也是粗体，但我们需要省略那句话，我们只需要提取那些用数字表示的文本

Answer 1

You can do it using this code:您可以使用以下代码执行此操作：

import pdfplumber
with pdfplumber.open('test.pdf') as pdf: 
    text = pdf.pages[0]
    clean_text = text.filter(lambda obj: obj["object_type"] == "char" and "Bold" in obj["fontname"])
    print(clean_text.extract_text())

It use pdfplumber library, so for more info you can check they documentation它使用pdfplumber库，因此有关更多信息，您可以查看他们的文档

Answer 2

Use This Code:使用此代码：

import pdfplumber
import re
demo = []
with pdfplumber.open('HCSC IL Inpatient_Outpatient Unbundling Policy- Facility.pdf') as pdf: 
    for i in range(0, 50):
        try:
            text = pdf.pages[i]  
            clean_text = text.filter(lambda obj: obj["object_type"] == "char" and "Bold" in obj["fontname"])
            demo.append(str(re.findall(r'(\d+\.\s.*\n?)+', clean_text.extract_text())).replace('[]', ' '))
        except IndexError:
            print("")
            break

如何使用 python 从 pdf 中提取粗体文本？

问题描述

2 个解决方案

解决方案1
0 2022-01-31 20:59:34

解决方案2
0 已采纳 2022-02-01 17:36:08

如何使用 python 从 pdf 中提取粗体文本？

问题描述

2 个解决方案

解决方案1 0 2022-01-31 20:59:34

解决方案2 0 已采纳 2022-02-01 17:36:08

解决方案1
0 2022-01-31 20:59:34

解决方案2
0 已采纳 2022-02-01 17:36:08