![](/img/trans.png)
[英]Python 3: How do I determine the re-occuring sub-string/pattern within a string?
[英]How do I write a relevant REGEX pattern to extract sub-string of a larger text string in python
我有一个数据框data
(带有冗长且不一致的文本字符串注释)和匹配的ID。 我的目标是使用子字符串list
提取感兴趣的相关子字符串,并为提取的子字符串创建一个新列。 有人告诉我正则表达式是一个不错的起点,但是我还没有想出一个可以产生匹配结果的好的模式。 我希望有人看到这个,并以正确的方式指导我解决这个问题。
list = ['sentara williamsburg regional medical',
'shady grove adventist hospital',
'sibley memorial hospital',
'southern maryland hospital center',
'st. mary`s hospital',
'suburban hospital healthcare system',
'the cancer center at lake manassas',
'ucla medical center',
'united medical center- greater southeast community',
'univ of md charles regional medical ctr',
'university of maryland medical center',
'university of north carolina hospital',
'university of virginia health system',
'unknown facility',
'va medical center',
'virginia hospital center-arlington',
'walter reed army medical center',
'washington adventist hospital',
'washington hospital center',
'wellstar health system, inc',
'winchester medical center']
data:
ID Notes
530.0 Cancer is best diag @Wwashington Adventist Hospital
651.0 nan
692.0 GMC-009 can be accessed at ST. Mary`s but not in UCLA Med. Center
993.0 I'm not sure of Sibley; however, Shady Grove Adventist Hosp. is great hospital
044.0 nan
055.0 2015-01-20 was the day she visited WR Army Medical Center in WDC
476.0 nan
预期的输出-情况确实无关紧要!
data_out:
ID Notes
530.0 Washington Adventist Hospital
651.0 nan
692.0 ST. Mary`s Hospital, UCLA Medical Center
993.0 Sibley Memorial Hoapital, Shady Grove Adventist Hospital
044.0 nan
055.0 Walter Reed Army Medical Center
476.0 nan
我会做的。 喜欢:
import re
reg = re.compile('|'.join(your_list))
results = reg.match(your_data)
已更新:此代码遍历列表的所有单词,并将它们与“注释”列进行比较。 如果在“列表”和“注释”中都有一个单词,则该单词将被写在新的“输出”列中。 您必须使用正则表达式才能获得所需的结果。 注意:由于“列表”中的单词可能看起来完全不同,但与“列”中的单词具有相同的含义(缩写,拼写,错误,区分大小写),因此很难做到所有不同的情况。 因此,也许用“词袋方法”解决这个问题对谁有用?
#Create a new list
newlist=[]
#Split the sentences of the "Notes" column
[newlist.append(data.loc[i,"Notes"].split(" ")) for i in range(len(data["Notes"]))]
#Create the new column "output" and default the values to be the same as in the column "Notes"
data["output"]=data["Notes"]
#Run through all words
for i in range(len(list)):
for j in range(len(newlist)):
for element in range(len(newlist[j])):
if re.search(newlist[j][element],list[i]):
data.loc[j,"output"]= "' '{0}".format(newlist[j][element])
如果有更矢量化的方法,我将不胜感激
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.