簡體   English   中英

如何從Python中的同一字符串中提取多個名稱

[英]How to extract multiple names from the same string in Python

我正在抓取數據並解析字符串中的名稱。 例如,我正在處理類似於以下內容的字符串:

Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr. Elton John, Public Health Director for Davis County

Dr. Winston Bishop from UC San Francisco and Usain Bolt from UC San Francisco

是否有代碼可以將這些文本轉換為數據集?

這樣,數據看起來像這樣:

   Name           Affiliation
Sharif Amlani   UC Davis Health
Joe Biden       UC San Francisco
Elton John      Public Health Director for Davis County
Winston Bishop  UC San Francisco
Usain Bolt      UC San Francisco

謝謝

這是此示例文本的示例代碼:

text = "\
Sharif Amlani   UC Davis Health\n\
Joe Biden       UC San Francisco\n\
Elton John      Public Health Director for Davis County\n\
Winston Bishop  UC San Francisco\n\
Usain Bolt      UC San Francisco"

lines = text.split('\n')
df = pd.concat([pd.DataFrame([[line[0:16].strip(),line[16:].strip()]]) for line in lines])

如果您的字符串始終采用格式name from place and name from place ,您可以這樣做:

import pandas as pd

# your consistently formatted string
s = "Dr. Winston Bishop from UC San Francisco and Usain Bolt from UC San Francisco" 

l = list() # a list to keep track of data - I am sure there's a better way to do this
for row in s.split('and'): # each row looks like "name from affiliation"
    # l = [(name, affiliation), ...]
    l.append(n.split((n.strip() for n in row.split('from')) 

# then create the DataFrame
df = pd.DataFrame(data = l, columns = ['Name', 'Affiliation'])

# you might want to strip the names and affiliations using pandas DataFrame using a lambda expression

在抓取中,這一切都歸結為模式匹配。 如果字符串的格式不一致,可能會非常痛苦。 不幸的是,在您的情況下,情況似乎確實如此。 所以,我建議根據具體情況考慮。

我可以觀察到一種這樣的模式,除了一個例外,所有名字都以“博士”開頭。 您可以使用它來提取帶有正則表達式的名稱。

import re

text = "Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr. Elton John, Public Health Director for Davis County"

regex = '((Dr.)( [A-Z]{1}[a-z]+)+)' # this will return three groups of matches

names = [match[0] for match in re.findall(regex, text)] #simply extracting the first group of matches, which is the name

您可以將其應用於其他字符串,但正如我上面提到的,限制是它只能捕獲以“Dr.”開頭的名稱。 您也可以對 Affiliations 使用類似的策略。 請注意,',' 分隔名稱和從屬關系,因此我們可以使用它。

import re

text = "Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr. Elton John, Public Health Director for Davis County"

affiliations = [term for term in text.split(',') if 'Dr.' not in term] # splitting the text by a comma and then excluding results that contain a 'Dr.'

同樣,您必須針對特定文本定制解決方案,但希望這可以幫助您思考問題。 最后,您可以使用 Pandas 將結果組合到一個數據框中:

import pandas as pd

data = pd.DataFrame(list(zip(names, affiliations)), columns = ['Name', 'Affiliation'])

您可以進行正則表達式匹配並創建 df。 在此處顯示一個字符串的示例方法:

text = "Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr. 
Elton John, Public Health Director for Davis County"
text = text.replace(', and' ,',')

re.findall("([\w\s]+),([\w\s]+)",text)
df = pd.DataFrame(r)
df.columns = ("Name", "Affiliation")
print(df)

輸出:

           Name                               Affiliation
0   Sharif Amlani                           UC Davis Health
1       Joe Biden                          UC San Francisco
2      Elton John   Public Health Director for Davis County

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM