简体   繁体   English

有没有办法在 ptyhon 中的粗体文本之后提取句子?

[英]Is there a way to extract sentences after bold text in ptyhon?

so i have extracted some bold text from a pdf in python.所以我从 python 的 pdf 中提取了一些粗体文本。 Which works fine.哪个工作正常。 but i want to extract also the sentence, or more then one sentence after the bold text, eg " Blue sky is what we see when we look up"但我也想提取句子,或者在粗体文本之后提取一个以上的句子,例如“ Blue sky is what we see when we look up”

I can extract the blue sky part.我可以提取蓝天部分。 But I'm not able to extract the "is what we see when we look up" part.但我无法提取“是我们抬头时看到的”部分。

import pdfplumber 

with pdfplumber.open('C:/Users/somefile.pdf') as pdf: 
    for i in range(12, 15):
        text = pdf.pages[i]
        clean_text = text.filter(lambda obj: obj["object_type"] == "char" and "Bold" in obj["fontname"])
        print(clean_text.extract_text())

pdfplumber returns a Python string as a result of filtering the page content by given formatting criteria like finding all of the bold text in the page. pdfplumber返回一个 Python 字符串,作为通过给定格式标准过滤页面内容的结果,例如查找页面中的所有粗体文本。

Notice, that all the bold text will be returned, not only the first found section.请注意,将返回所有粗体文本,而不仅仅是第一个找到的部分。 This is the reason I have in the following code limited the size of the found string to a size you are sure to be able to find in the page text as continuous text.这就是我在以下代码中将找到的字符串的大小限制为您确定能够在页面文本中作为连续文本找到的大小的原因。

With .extract_text() you can get the text of the entire page as a Python string and this way you have two Python strings to use in accomplishing what you are after.使用.extract_text()您可以将整个页面的文本作为 Python 字符串获取,这样您就有两个 Python 字符串用于完成您所追求的目标。 To give an example of how-to I have chosen in the code below the dot .举一个我在点下面的代码中选择的操作方法的例子. as delimiter in order to extract the entire sentence " Blue sky is what we see when we look up."作为分隔符,以便提取整个句子“ Blue sky is what we see when we look up”。

Using the code below you should be able to extract that sentence:使用下面的代码,您应该能够提取该句子:

import pdfplumber 
with pdfplumber.open(str_pdf_file, password='') as pdf:
    first_page = pdf.pages[0]
    first_page_entire_text = first_page.extract_text()
    first_page_bold_text   = first_page.filter(lambda obj: obj["object_type"] == "char" and "Bold" in obj["fontname"]).extract_text()
    indx_bold = first_page_entire_text.find(first_page_bold_text[0:8])
    indx_dot  = first_page_entire_text[indx_bold:].find('.')
    sentence_starting_with_bold = first_page_entire_text[indx_bold: indx_bold + indx_dot + 1]
    print(sentence_starting_with_bold)    

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM