I am working on scraping data and parsing out the names within a string. For example, I'm working with strings that look similar to the following:
Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr. Elton John, Public Health Director for Davis County
and
Dr. Winston Bishop from UC San Francisco and Usain Bolt from UC San Francisco
Is there code to take texts likes these and transform them into a dataset?
Such that, the data look like this:
Name Affiliation
Sharif Amlani UC Davis Health
Joe Biden UC San Francisco
Elton John Public Health Director for Davis County
Winston Bishop UC San Francisco
Usain Bolt UC San Francisco
Thanks
Here is an axample code for this sample text:
text = "\
Sharif Amlani UC Davis Health\n\
Joe Biden UC San Francisco\n\
Elton John Public Health Director for Davis County\n\
Winston Bishop UC San Francisco\n\
Usain Bolt UC San Francisco"
lines = text.split('\n')
df = pd.concat([pd.DataFrame([[line[0:16].strip(),line[16:].strip()]]) for line in lines])
If your string are always in the format name from place and name from place
, you can do it as:
import pandas as pd
# your consistently formatted string
s = "Dr. Winston Bishop from UC San Francisco and Usain Bolt from UC San Francisco"
l = list() # a list to keep track of data - I am sure there's a better way to do this
for row in s.split('and'): # each row looks like "name from affiliation"
# l = [(name, affiliation), ...]
l.append(n.split((n.strip() for n in row.split('from'))
# then create the DataFrame
df = pd.DataFrame(data = l, columns = ['Name', 'Affiliation'])
# you might want to strip the names and affiliations using pandas DataFrame using a lambda expression
In scraping it all comes down to pattern matching. It can be VERY painful if the string is not consistently formatted. In your case, that, unfortunately, seems to be the case. So, I'll advise taking it on a case by case basis.
One such pattern I can observe, with one exception, that all names start with a 'Dr.' you can use this to extract names with regular expressions.
import re
text = "Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr. Elton John, Public Health Director for Davis County"
regex = '((Dr.)( [A-Z]{1}[a-z]+)+)' # this will return three groups of matches
names = [match[0] for match in re.findall(regex, text)] #simply extracting the first group of matches, which is the name
You can apply this to other strings, but the limitation, as I mentioned above, is that it will only capture names starting with 'Dr.'. You can use a similar strategy for Affiliations as well. Notice that a ',' separates names and affiliations so we can use this.
import re
text = "Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr. Elton John, Public Health Director for Davis County"
affiliations = [term for term in text.split(',') if 'Dr.' not in term] # splitting the text by a comma and then excluding results that contain a 'Dr.'
Again, you'll have to tailor your solution to the specific text, but hopefully, this can assist in your thinking about the problem. Finally, you can combine the results into a data frame using pandas:
import pandas as pd
data = pd.DataFrame(list(zip(names, affiliations)), columns = ['Name', 'Affiliation'])
You could do a regex match and create a df. Showing the sample approach for one string here:
text = "Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr.
Elton John, Public Health Director for Davis County"
text = text.replace(', and' ,',')
re.findall("([\w\s]+),([\w\s]+)",text)
df = pd.DataFrame(r)
df.columns = ("Name", "Affiliation")
print(df)
Output:
Name Affiliation
0 Sharif Amlani UC Davis Health
1 Joe Biden UC San Francisco
2 Elton John Public Health Director for Davis County
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.