有没有办法在 ptyhon 中的粗体文本之后提取句子？

Question

so i have extracted some bold text from a pdf in python.所以我从 python 的 pdf 中提取了一些粗体文本。 Which works fine.哪个工作正常。 but i want to extract also the sentence, or more then one sentence after the bold text, eg " Blue sky is what we see when we look up"但我也想提取句子，或者在粗体文本之后提取一个以上的句子，例如“ Blue sky is what we see when we look up”

I can extract the blue sky part.我可以提取蓝天部分。 But I'm not able to extract the "is what we see when we look up" part.但我无法提取“是我们抬头时看到的”部分。

import pdfplumber 

with pdfplumber.open('C:/Users/somefile.pdf') as pdf: 
    for i in range(12, 15):
        text = pdf.pages[i]
        clean_text = text.filter(lambda obj: obj["object_type"] == "char" and "Bold" in obj["fontname"])
        print(clean_text.extract_text())

Answer 1

pdfplumber returns a Python string as a result of filtering the page content by given formatting criteria like finding all of the bold text in the page. pdfplumber返回一个 Python 字符串，作为通过给定格式标准过滤页面内容的结果，例如查找页面中的所有粗体文本。

Notice, that all the bold text will be returned, not only the first found section.请注意，将返回所有粗体文本，而不仅仅是第一个找到的部分。 This is the reason I have in the following code limited the size of the found string to a size you are sure to be able to find in the page text as continuous text.这就是我在以下代码中将找到的字符串的大小限制为您确定能够在页面文本中作为连续文本找到的大小的原因。

With .extract_text() you can get the text of the entire page as a Python string and this way you have two Python strings to use in accomplishing what you are after.使用.extract_text()您可以将整个页面的文本作为 Python 字符串获取，这样您就有两个 Python 字符串用于完成您所追求的目标。 To give an example of how-to I have chosen in the code below the dot .举一个我在点下面的代码中选择的操作方法的例子. as delimiter in order to extract the entire sentence " Blue sky is what we see when we look up."作为分隔符，以便提取整个句子“ Blue sky is what we see when we look up”。

Using the code below you should be able to extract that sentence:使用下面的代码，您应该能够提取该句子：

import pdfplumber 
with pdfplumber.open(str_pdf_file, password='') as pdf:
    first_page = pdf.pages[0]
    first_page_entire_text = first_page.extract_text()
    first_page_bold_text   = first_page.filter(lambda obj: obj["object_type"] == "char" and "Bold" in obj["fontname"]).extract_text()
    indx_bold = first_page_entire_text.find(first_page_bold_text[0:8])
    indx_dot  = first_page_entire_text[indx_bold:].find('.')
    sentence_starting_with_bold = first_page_entire_text[indx_bold: indx_bold + indx_dot + 1]
    print(sentence_starting_with_bold)

有没有办法在 ptyhon 中的粗体文本之后提取句子？

问题描述

1 个解决方案

解决方案1
0 2022-08-31 20:11:31

有没有办法在 ptyhon 中的粗体文本之后提取句子？

问题描述

1 个解决方案

解决方案1 0 2022-08-31 20:11:31

解决方案1
0 2022-08-31 20:11:31