I am trying to extract the names from a block of text, since there are only few names that can ever occur it is quite easy to just preconstruct list of names and I would like to match them in a text. For example, I have the following list:
names = [ "Wim Duisenberg", "Jean-Claude Trichet", "Mario Draghi", "Christine Lagarde"]
And the following block of text that is scraped via beautiful soup:
print(textauthors)
<h2 class="ecb-pressContentSubtitle">Mario Draghi, President of the ECB, <br/>Vítor Constâncio, Vice-President of the ECB, <br/>Frankfurt am Main, 20 October 2016</h2>
I tried the following solution (based on this answer on stack overflow):
def exact_Match(textauthors, names):
b = r'(\s|^|$)'
res = return re.match(b + word + b, phrase, flags=re.IGNORECASE)
print(res)
It gives me an error of incorrect syntax and I am not sure how to solve it. Also let me in advance apologize if there is already answer for this somewhere on stack overflow, I am python beginner and I am not really sure how to even search for the right question. When I search for matching of names I see answers which try to do it with nltk but that is not really appropriate for me where I want to get exact match and when I try to search for match based on string text I cant find the answer that would work for me.
This will give you authors from textauthors:
import re
textauthors = '<h2 class="ecb-pressContentSubtitle">Mario Draghi, President of the ECB, <br/>Vítor Constâncio, Vice-President of the ECB, <br/>Frankfurt am Main, 20 October 2016</h2>'
regex = r">(?P<name>[^\s]+\s[^\s]+),"
matches = re.findall(regex, textauthors)
print(matches) # ['Mario Draghi', 'Vítor Constâncio']
of course if you need to extract authors from your textauthors
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.