从字符串中提取出现在关键字之前的单词/句子 - Python

Question

I have a string like this,我有一个这样的字符串，

my_str ='·in this match, dated may 1, 2013 (the "the match") is between brooklyn centenniel, resident of detroit, michigan ("champion") and kamil kubaru, the challenger from alexandria, virginia ("underdog").'

Now, I want to extract the current champion and the underdog using keywords champion and underdog .现在，我想提取当前的champion和underdog使用关键字champion和underdog 。

What is really challenging here is both contender's names appear before the keyword inside parenthesis.这里真正具有挑战性的是两个竞争者的名字都出现在括号内的关键字之前。 I want to use regular expression and extract information.我想使用正则表达式并提取信息。

Following is what I did,以下是我所做的，

champion = re.findall(r'("champion"[^.]*.)', my_str)
print(champion)

>> ['"champion") and kamil kubaru, the challenger from alexandria, virginia ("underdog").']


underdog = re.findall(r'("underdog"[^.]*.)', my_str)
print(underdog)

>>['"underdog").']

However, I need the results, champion as :但是，我需要结果， champion as ：

brooklyn centenniel, resident of detroit, michigan

and the underdog as:和underdog为：

kamil kubaru, the challenger from alexandria, virginia

How can I do this using regular expression?我如何使用正则表达式来做到这一点？ (I have been searching, if I could go back couple or words from the keyword to get the result I want, but no luck yet) Any help or suggestion would be appreciated. （我一直在搜索，如果我可以从关键字中返回几个或几个词以获得我想要的结果，但还没有运气）任何帮助或建议将不胜感激。

Answer 1

You can use named captured group to capture the desired results:您可以使用命名捕获组来捕获所需的结果：

between\s+(?P<champion>.*?)\s+\("champion"\)\s+and\s+(?P<underdog>.*?)\s+\("underdog"\)

between\\s+(?P<champion>.*?)\\s+\\("champion"\\) matches the chunk from between to ("champion") and put the desired portion in between as the named captured group champion between\\s+(?P<champion>.*?)\\s+\\("champion"\\)匹配从between到("champion")的块，并将所需的部分放在中间作为命名的捕获组champion
After that, \\s+and\\s+(?P<underdog>.*?)\\s+\\("underdog"\\) matches the chunk upto ("underdog") and again get the desired portion from here as named captured group underdog之后， \\s+and\\s+(?P<underdog>.*?)\\s+\\("underdog"\\)匹配块 upto ("underdog")并再次从这里获取所需的部分作为命名的捕获组underdog

Example:例子：

In [26]: my_str ='·in this match, dated may 1, 2013 (the "the match") is between brooklyn centenniel, resident of detroit, michigan ("champion") and kamil kubaru, the challenger from alexandria, virginia 
    ...: ("underdog").'

In [27]: out = re.search(r'between\s+(?P<champion>.*?)\s+\("champion"\)\s+and\s+(?P<underdog>.*?)\s+\("underdog"\)', my_str)

In [28]: out.groupdict()
Out[28]: 
{'champion': 'brooklyn centenniel, resident of detroit, michigan',
 'underdog': 'kamil kubaru, the challenger from alexandria, virginia'}

Answer 2

There will be a better answer than this, and I don't know regex at all, but I'm bored, so here's my 2 cents.会有比这更好的答案，我根本不懂正则表达式，但我很无聊，所以这是我的 2 美分。

Here's how I would go about it:这是我将如何去做：

words = my_str.split()
index = words.index('("champion")')
champion = words[index - 6:index]
champion = " ".join(champion)

for the underdog, you will have to change the 6 to a 7, and '("champion")' to '("underdog").'对于弱者，您必须将 6 更改为 7，并将'("champion")'更改为'("underdog").'

Not sure if this will solve your problem, but for this particular string, this worked when I tested it.不确定这是否能解决您的问题，但对于这个特定的字符串，当我测试它时，这有效。

You could also use str.strip() to remove punctuation if that trailing period on underdog is a problem.如果失败者的尾随句点有问题，您还可以使用str.strip()删除标点符号。

从字符串中提取出现在关键字之前的单词/句子 - Python

问题描述

2 个解决方案

解决方案1
1 已采纳 2018-02-23 18:23:20

解决方案2
0 2018-02-23 18:31:33

从字符串中提取出现在关键字之前的单词/句子 - Python

问题描述

2 个解决方案

解决方案1 1 已采纳 2018-02-23 18:23:20

解决方案2 0 2018-02-23 18:31:33

解决方案1
1 已采纳 2018-02-23 18:23:20

解决方案2
0 2018-02-23 18:31:33