简体   繁体   English

如何从文本中提取特定部分

[英]How to extract a specific part from text

I have a string containing many words. 我有一个包含很多单词的字符串。 I need to extract specific part from it. 我需要从中提取特定部分。 Below is the details: 以下是详细信息:

Suppose, I have following string: 假设我有以下字符串:

x = "I am amartya ccccc amartya xxxxx amartya yyyyy amartya mohan tagore bvfvhbvbv amartya vfvbvbvfhv amartya"

Now I want to extract the content between amartya and tagore but that should exactly be 'mohan' ie, the question of the occurrence is coming into picture. 现在,我想提取amartyatagore之间的内容, amartya应该恰好是'mohan'即出现的问题'mohan' Ihave used regexp but that gave me content as below: "ccccc amartya xxxxx amartya yyyyy amartya mohan" , but I want only 'mohan' as my o/p. 我使用过正则表达式,但是给了我以下内容: "ccccc amartya xxxxx amartya yyyyy amartya mohan" ,但是我只希望'mohan'作为我的o / p。

This regular expression works for your specific example: 此正则表达式适用于您的特定示例:

r = re.search("(amartya)(?!.*amartya.*tagore)(.*)(tagore)", x)
r.group(2).strip()

It basically says: match a pattern starting with "amartya" and ending with "tagore" and anything between them doesn't contain the word "amartya" again. 它基本上说:匹配以“ amartya”开头并以“ tagore”结尾的模式,并且它们之间的任何内容都不再包含“ amartya”一词。

The second group is the (.*) which matches anything between "amartya" and "tagore" 第二组是(.*) ,它匹配“ amartya”和“ tagore”之间的任何内容

From the docs ( re ): 从文档( re ):

(?!...)

Matches if ... doesn't match next. 如果...下一个不匹配则匹配。 This is a negative lookahead assertion. 这是一个否定的超前断言。 For example, Isaac (?!Asimov) will match 'Isaac ' only if it's not followed by 'Asimov' . 例如,仅当Isaac (?!Asimov)不带'Asimov'它才会与'Isaac '匹配。

Hope that helps. 希望能有所帮助。

in this case you could start splitting at "tagore" afterwards split "amartya" and catch the last piece of string: 在这种情况下,您可以在“ tagore”处开始拆分,然后在“ amartya”处拆分并捕获最后的字符串:

x = "I am amartya ccccc amartya xxxxx amartya yyyyy amartya mohan tagore bvfvhbvbv amartya vfvbvbvfhv amartya"

x1 = x.split('tagore')[0]
print(x1)
#I am amartya ccccc amartya xxxxx amartya yyyyy amartya mohan 
x2 = x1.split('amartya')[-1]
print(x2.strip(" "))
#mohan

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM