繁体   English   中英

如何从Python中的同一字符串中提取多个名称

[英]How to extract multiple names from the same string in Python

我正在抓取数据并解析字符串中的名称。 例如,我正在处理类似于以下内容的字符串:

Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr. Elton John, Public Health Director for Davis County

Dr. Winston Bishop from UC San Francisco and Usain Bolt from UC San Francisco

是否有代码可以将这些文本转换为数据集?

这样,数据看起来像这样:

   Name           Affiliation
Sharif Amlani   UC Davis Health
Joe Biden       UC San Francisco
Elton John      Public Health Director for Davis County
Winston Bishop  UC San Francisco
Usain Bolt      UC San Francisco

谢谢

这是此示例文本的示例代码:

text = "\
Sharif Amlani   UC Davis Health\n\
Joe Biden       UC San Francisco\n\
Elton John      Public Health Director for Davis County\n\
Winston Bishop  UC San Francisco\n\
Usain Bolt      UC San Francisco"

lines = text.split('\n')
df = pd.concat([pd.DataFrame([[line[0:16].strip(),line[16:].strip()]]) for line in lines])

如果您的字符串始终采用格式name from place and name from place ,您可以这样做:

import pandas as pd

# your consistently formatted string
s = "Dr. Winston Bishop from UC San Francisco and Usain Bolt from UC San Francisco" 

l = list() # a list to keep track of data - I am sure there's a better way to do this
for row in s.split('and'): # each row looks like "name from affiliation"
    # l = [(name, affiliation), ...]
    l.append(n.split((n.strip() for n in row.split('from')) 

# then create the DataFrame
df = pd.DataFrame(data = l, columns = ['Name', 'Affiliation'])

# you might want to strip the names and affiliations using pandas DataFrame using a lambda expression

在抓取中,这一切都归结为模式匹配。 如果字符串的格式不一致,可能会非常痛苦。 不幸的是,在您的情况下,情况似乎确实如此。 所以,我建议根据具体情况考虑。

我可以观察到一种这样的模式,除了一个例外,所有名字都以“博士”开头。 您可以使用它来提取带有正则表达式的名称。

import re

text = "Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr. Elton John, Public Health Director for Davis County"

regex = '((Dr.)( [A-Z]{1}[a-z]+)+)' # this will return three groups of matches

names = [match[0] for match in re.findall(regex, text)] #simply extracting the first group of matches, which is the name

您可以将其应用于其他字符串,但正如我上面提到的,限制是它只能捕获以“Dr.”开头的名称。 您也可以对 Affiliations 使用类似的策略。 请注意,',' 分隔名称和从属关系,因此我们可以使用它。

import re

text = "Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr. Elton John, Public Health Director for Davis County"

affiliations = [term for term in text.split(',') if 'Dr.' not in term] # splitting the text by a comma and then excluding results that contain a 'Dr.'

同样,您必须针对特定文本定制解决方案,但希望这可以帮助您思考问题。 最后,您可以使用 Pandas 将结果组合到一个数据框中:

import pandas as pd

data = pd.DataFrame(list(zip(names, affiliations)), columns = ['Name', 'Affiliation'])

您可以进行正则表达式匹配并创建 df。 在此处显示一个字符串的示例方法:

text = "Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr. 
Elton John, Public Health Director for Davis County"
text = text.replace(', and' ,',')

re.findall("([\w\s]+),([\w\s]+)",text)
df = pd.DataFrame(r)
df.columns = ("Name", "Affiliation")
print(df)

输出:

           Name                               Affiliation
0   Sharif Amlani                           UC Davis Health
1       Joe Biden                          UC San Francisco
2      Elton John   Public Health Director for Davis County

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM