简体   繁体   English

如何使用 python 从 pdf 中提取粗体文本?

[英]How to extract Bold text from pdf using python?

The list below provides examples of items and services that should not be billed separately.下面的列表提供了不应单独计费的项目和服务的示例。 Please note that the list is not all inclusive.请注意,该列表并非包含所有内容。

1. Surgical rooms and services – To include surgical suites, major and minor, treatment rooms, endoscopy labs, cardiac cath labs, X-ray. 1. 手术室和服务——包括手术室、主要和次要手术室、治疗室、内窥镜实验室、心导管实验室、X 射线。

2. Facility Basic Charges - pulmonary and cardiology procedural rooms. 2. 设施基本费用- 肺和心脏手术室。 The hospital's charge for surgical suites and services shall include the entire above listed nursing personnel services, supplies, and equipment医院对手术室和服务的收费应包括上述全部护理人员服务、用品和设备

I want output like:我想要 output 像:

  1. Surgical rooms and services手术室和服务
  2. Facility Basic Charges设施基本费用

there is first sentence also bold but we need to omit that sentence, we need to extract only those text which are represented with numbers第一个句子也是粗体,但我们需要省略那句话,我们只需要提取那些用数字表示的文本

You can do it using this code:您可以使用以下代码执行此操作:

import pdfplumber
with pdfplumber.open('test.pdf') as pdf: 
    text = pdf.pages[0]
    clean_text = text.filter(lambda obj: obj["object_type"] == "char" and "Bold" in obj["fontname"])
    print(clean_text.extract_text())

It use pdfplumber library, so for more info you can check they documentation它使用pdfplumber库,因此有关更多信息,您可以查看他们的文档

Use This Code:使用此代码:

import pdfplumber
import re
demo = []
with pdfplumber.open('HCSC IL Inpatient_Outpatient Unbundling Policy- Facility.pdf') as pdf: 
    for i in range(0, 50):
        try:
            text = pdf.pages[i]  
            clean_text = text.filter(lambda obj: obj["object_type"] == "char" and "Bold" in obj["fontname"])
            demo.append(str(re.findall(r'(\d+\.\s.*\n?)+', clean_text.extract_text())).replace('[]', ' '))
        except IndexError:
            print("")
            break

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM