简体   繁体   English

如何使用关键字或 ZE83AED3DDF4667DEC0DAAAACB2BB3BE0BZ 提取句子前后的句子?

[英]How to also extract sentence before and after sentence with keyword or substring?

I would like to create a function that extracts a sentence that contains a keyword or substring of interest, as well as the sentences before and after.我想创建一个 function 来提取包含感兴趣的关键字或 substring 的句子以及前后的句子。 If possible I would like to specify the number of sentences I would like to extract.如果可能的话,我想指定我想提取的句子数量。 I hope that the function will not return an error if the key sentence is the first sentence.我希望 function 如果关键句是第一句就不会返回错误。

In the example below, I created a function that extracts a single sentence.在下面的示例中,我创建了一个提取单个句子的 function。 How can I expand to include more sentences?如何扩展以包含更多句子?


data = [[0, 'Johannes Gensfleisch zur Laden zum Gutenberg was a German inventor, printer, publisher, and goldsmith who introduced printing to Europe with his mechanical movable-type printing press. His work started the Printing Revolution in Europe and is regarded as a milestone of the second millennium, ushering in the modern period of human history. It played a key role in the development of the Renaissance, Reformation, Age of Enlightenment, and Scientific Revolution, as well as laying the material basis for the modern knowledge-based economy and the spread of learning to the masses.'], 
[1, 'While not the first to use movable type in the world,[a] Gutenberg was the first European to do so. His many contributions to printing include the invention of a process for mass-producing movable type; the use of oil-based ink for printing books;[7] adjustable molds;[8] mechanical movable type; and the use of a wooden printing press similar to the agricultural screw presses of the period.[9] His truly epochal invention was the combination of these elements into a practical system that allowed the mass production of printed books and was economically viable for printers and readers alike. Gutenbergs method for making type is traditionally considered to have included a type metal alloy and a hand mould for casting type. The alloy was a mixture of lead, tin, and antimony that melted at a relatively low temperature for faster and more economical casting, cast well, and created a durable type.'], 
[2, 'The use of movable type was a marked improvement on the handwritten manuscript, which was the existing method of book production in Europe, and upon woodblock printing, and revolutionized European book-making. Gutenbergs printing technology spread rapidly throughout Europe and later the world. His major work, the Gutenberg Bible (also known as the 42-line Bible), was the first printed version of the Bible and has been acclaimed for its high aesthetic and technical quality. In Renaissance Europe, the arrival of mechanical movable type printing introduced the era of mass communication which permanently altered the structure of society. The relatively unrestricted circulation of information—including revolutionary ideas—transcended borders, captured the masses in the Reformation, and threatened the power of political and religious authorities; the sharp increase in literacy broke the monopoly of the literate elite on education and learning and bolstered the emerging middle class. Across Europe, the increasing cultural self-awareness of its people led to the rise of proto-nationalism, accelerated by the flowering of the European vernacular languages to the detriment of Latins status as lingua franca. In the 19th century, the replacement of the hand-operated Gutenberg-style press by steam-powered rotary presses allowed printing on an industrial scale, while Western-style printing was adopted all over the world, becoming practically the sole medium for modern bulk printing. ']]

# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['text_number', 'text'])

def extract_key_sentences(text,word_list):
    joined_word_list = '|'.join(word_list)
    print(joined_word_list)
    sentence = re.findall(r'([^.]*?'+joined_word_list+'[^.]*\.)', text)
    return sentence

tools_list=['printing press','paper','ink','woodblock','molds','method']
df['key_sentence']=df['text'].apply(lambda x : extract_key_sentences(str(x),tools_list))
df['key_sentence'].head()

With the current approach, it seems that the full sentence is not being extracted, see row 3, the text starts with 'method of book production'.使用当前的方法,似乎没有提取完整的句子,见第 3 行,文本以“书籍制作方法”开头。

IIUC, you can split your strings and explode to have one sentence per row, identify the sentences matching the keywords, and use a groupby.rolling.max to propagate to the neighboring sentences. explode ,您可以split字符串并分解为每行一个句子,识别与关键字匹配的句子,并使用groupby.rolling.max传播到相邻的句子。

Then aggregate back as a single string (optional):然后聚合为单个字符串(可选):

word_list=['printing press','paper','ink','woodblock','molds','method']
joined_word_list = '|'.join(map(re.escape, word_list))

N = 3

df[f'{N}_around'] = (df['text']
 .str.split('(?<=\.)\s*').explode()
 .loc[lambda d: 
      d.str.contains(joined_word_list)
       .groupby(level=0).rolling(2*N+1, center=True, min_periods=1).max()
       .droplevel(1).astype(bool)
     ]
 .groupby(level=0).agg(' '.join)
)

In this particular case, for the third row, it matches only the first sentence and propagates to keep the following 3 ones, dropping the remaining 4.在这种特殊情况下,对于第三行,它只匹配第一句并传播以保留后面的 3 个,删除剩余的 4 个。

output: output:

   text_number                                               text  \
0            0  Johannes Gensfleisch zur Laden zum Gutenberg w...   
1            1  While not the first to use movable type in the...   
2            2  The use of movable type was a marked improveme...   

                                            3_around  
0  Johannes Gensfleisch zur Laden zum Gutenberg w...  
1  While not the first to use movable type in the...  
2  The use of movable type was a marked improveme...  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM