简体   繁体   中英

How to extract multiple names from the same string in Python

I am working on scraping data and parsing out the names within a string. For example, I'm working with strings that look similar to the following:

Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr. Elton John, Public Health Director for Davis County

and

Dr. Winston Bishop from UC San Francisco and Usain Bolt from UC San Francisco

Is there code to take texts likes these and transform them into a dataset?

Such that, the data look like this:

   Name           Affiliation
Sharif Amlani   UC Davis Health
Joe Biden       UC San Francisco
Elton John      Public Health Director for Davis County
Winston Bishop  UC San Francisco
Usain Bolt      UC San Francisco

Thanks

Here is an axample code for this sample text:

text = "\
Sharif Amlani   UC Davis Health\n\
Joe Biden       UC San Francisco\n\
Elton John      Public Health Director for Davis County\n\
Winston Bishop  UC San Francisco\n\
Usain Bolt      UC San Francisco"

lines = text.split('\n')
df = pd.concat([pd.DataFrame([[line[0:16].strip(),line[16:].strip()]]) for line in lines])

If your string are always in the format name from place and name from place , you can do it as:

import pandas as pd

# your consistently formatted string
s = "Dr. Winston Bishop from UC San Francisco and Usain Bolt from UC San Francisco" 

l = list() # a list to keep track of data - I am sure there's a better way to do this
for row in s.split('and'): # each row looks like "name from affiliation"
    # l = [(name, affiliation), ...]
    l.append(n.split((n.strip() for n in row.split('from')) 

# then create the DataFrame
df = pd.DataFrame(data = l, columns = ['Name', 'Affiliation'])

# you might want to strip the names and affiliations using pandas DataFrame using a lambda expression

In scraping it all comes down to pattern matching. It can be VERY painful if the string is not consistently formatted. In your case, that, unfortunately, seems to be the case. So, I'll advise taking it on a case by case basis.

One such pattern I can observe, with one exception, that all names start with a 'Dr.' you can use this to extract names with regular expressions.

import re

text = "Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr. Elton John, Public Health Director for Davis County"

regex = '((Dr.)( [A-Z]{1}[a-z]+)+)' # this will return three groups of matches

names = [match[0] for match in re.findall(regex, text)] #simply extracting the first group of matches, which is the name

You can apply this to other strings, but the limitation, as I mentioned above, is that it will only capture names starting with 'Dr.'. You can use a similar strategy for Affiliations as well. Notice that a ',' separates names and affiliations so we can use this.

import re

text = "Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr. Elton John, Public Health Director for Davis County"

affiliations = [term for term in text.split(',') if 'Dr.' not in term] # splitting the text by a comma and then excluding results that contain a 'Dr.'

Again, you'll have to tailor your solution to the specific text, but hopefully, this can assist in your thinking about the problem. Finally, you can combine the results into a data frame using pandas:

import pandas as pd

data = pd.DataFrame(list(zip(names, affiliations)), columns = ['Name', 'Affiliation'])

You could do a regex match and create a df. Showing the sample approach for one string here:

text = "Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr. 
Elton John, Public Health Director for Davis County"
text = text.replace(', and' ,',')

re.findall("([\w\s]+),([\w\s]+)",text)
df = pd.DataFrame(r)
df.columns = ("Name", "Affiliation")
print(df)

Output:

           Name                               Affiliation
0   Sharif Amlani                           UC Davis Health
1       Joe Biden                          UC San Francisco
2      Elton John   Public Health Director for Davis County

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM