简体   繁体   English

如何从Python中的同一字符串中提取多个名称

[英]How to extract multiple names from the same string in Python

I am working on scraping data and parsing out the names within a string.我正在抓取数据并解析字符串中的名称。 For example, I'm working with strings that look similar to the following:例如,我正在处理类似于以下内容的字符串:

Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr. Elton John, Public Health Director for Davis County

and

Dr. Winston Bishop from UC San Francisco and Usain Bolt from UC San Francisco

Is there code to take texts likes these and transform them into a dataset?是否有代码可以将这些文本转换为数据集?

Such that, the data look like this:这样,数据看起来像这样:

   Name           Affiliation
Sharif Amlani   UC Davis Health
Joe Biden       UC San Francisco
Elton John      Public Health Director for Davis County
Winston Bishop  UC San Francisco
Usain Bolt      UC San Francisco

Thanks谢谢

Here is an axample code for this sample text:这是此示例文本的示例代码:

text = "\
Sharif Amlani   UC Davis Health\n\
Joe Biden       UC San Francisco\n\
Elton John      Public Health Director for Davis County\n\
Winston Bishop  UC San Francisco\n\
Usain Bolt      UC San Francisco"

lines = text.split('\n')
df = pd.concat([pd.DataFrame([[line[0:16].strip(),line[16:].strip()]]) for line in lines])

If your string are always in the format name from place and name from place , you can do it as:如果您的字符串始终采用格式name from place and name from place ,您可以这样做:

import pandas as pd

# your consistently formatted string
s = "Dr. Winston Bishop from UC San Francisco and Usain Bolt from UC San Francisco" 

l = list() # a list to keep track of data - I am sure there's a better way to do this
for row in s.split('and'): # each row looks like "name from affiliation"
    # l = [(name, affiliation), ...]
    l.append(n.split((n.strip() for n in row.split('from')) 

# then create the DataFrame
df = pd.DataFrame(data = l, columns = ['Name', 'Affiliation'])

# you might want to strip the names and affiliations using pandas DataFrame using a lambda expression

In scraping it all comes down to pattern matching.在抓取中,这一切都归结为模式匹配。 It can be VERY painful if the string is not consistently formatted.如果字符串的格式不一致,可能会非常痛苦。 In your case, that, unfortunately, seems to be the case.不幸的是,在您的情况下,情况似乎确实如此。 So, I'll advise taking it on a case by case basis.所以,我建议根据具体情况考虑。

One such pattern I can observe, with one exception, that all names start with a 'Dr.'我可以观察到一种这样的模式,除了一个例外,所有名字都以“博士”开头。 you can use this to extract names with regular expressions.您可以使用它来提取带有正则表达式的名称。

import re

text = "Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr. Elton John, Public Health Director for Davis County"

regex = '((Dr.)( [A-Z]{1}[a-z]+)+)' # this will return three groups of matches

names = [match[0] for match in re.findall(regex, text)] #simply extracting the first group of matches, which is the name

You can apply this to other strings, but the limitation, as I mentioned above, is that it will only capture names starting with 'Dr.'.您可以将其应用于其他字符串,但正如我上面提到的,限制是它只能捕获以“Dr.”开头的名称。 You can use a similar strategy for Affiliations as well.您也可以对 Affiliations 使用类似的策略。 Notice that a ',' separates names and affiliations so we can use this.请注意,',' 分隔名称和从属关系,因此我们可以使用它。

import re

text = "Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr. Elton John, Public Health Director for Davis County"

affiliations = [term for term in text.split(',') if 'Dr.' not in term] # splitting the text by a comma and then excluding results that contain a 'Dr.'

Again, you'll have to tailor your solution to the specific text, but hopefully, this can assist in your thinking about the problem.同样,您必须针对特定文本定制解决方案,但希望这可以帮助您思考问题。 Finally, you can combine the results into a data frame using pandas:最后,您可以使用 Pandas 将结果组合到一个数据框中:

import pandas as pd

data = pd.DataFrame(list(zip(names, affiliations)), columns = ['Name', 'Affiliation'])

You could do a regex match and create a df.您可以进行正则表达式匹配并创建 df。 Showing the sample approach for one string here:在此处显示一个字符串的示例方法:

text = "Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr. 
Elton John, Public Health Director for Davis County"
text = text.replace(', and' ,',')

re.findall("([\w\s]+),([\w\s]+)",text)
df = pd.DataFrame(r)
df.columns = ("Name", "Affiliation")
print(df)

Output:输出:

           Name                               Affiliation
0   Sharif Amlani                           UC Davis Health
1       Joe Biden                          UC San Francisco
2      Elton John   Public Health Director for Davis County

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM