简体   繁体   English

如何使用正则表达式从python的字符串中提取大学/学校/学院的名称?

[英]How to extract university/school/college name from string in python using regular expression?

SAMPLE CODE 样本代码

import re
line = "should we use regex more often, University of Pennsylvania. let me know at  321dsasdsa@dasdsa.com.lol"
match = re.search(r'/([A-Z][^\s,.]+[.]?\s[(]?)*(Hospital|University|Institute|Law School|School of|Academy)[^,\d]*(?=,|\d)/', line)
print(match.group(0))

I'm trying to extract University/School/Organization names from given string using regular expression in python but it gives an error message. 我正在尝试使用python中的正则表达式从给定的字符串中提取大学/学校/组织名称,但它给出了一条错误消息。

ERROR MESSAGE 错误信息

Traceback (most recent call last): File "C:/Python/addOrganization.py", line 4, in print(match.group(0)) AttributeError: 'NoneType' object has no attribute 'group' 追溯(最近一次通话):文件“ C:/Python/addOrganization.py”,行4,在print(match.group(0))中AttributeError:'NoneType'对象没有属性'group'

Instead of search ,Try the re.sub to print your expected output 代替搜索,尝试re.sub打印您的预期输出

import re
i = "should we use regex more often, University of Pennsylvania. let me know at  321dsasdsa@dasdsa.com.lol"
line = re.sub(r"[\w\W]* ((Hospital|University|Centre|Law School|School|Academy|Department)[\w -]*)[\w\W]*$", r"\1", i)
print line

The test string you've given is a made up one since the University name is immediately followed by a line terminator '.' 您给出的测试字符串是一个组成的字符串,因为大学名称后紧跟一个行终止符'。 while the other examples in your pastebin sample do not (they are followed by a comma). pastebin示例中的其他示例则没有(它们后面是逗号)。

line = should we use regex more often, University of Pennsylvania. let me know at 321dsasdsa@dasdsa.com.lol

I have managed to extract the names using a simple regex for examples in your pastebin you can see details here: regex101.com 我已经使用简单的正则表达式提取了名称,例如您的pastebin示例,您可以在此处查看详细信息: regex101.com

Logic 逻辑

Since the institute name is separated by a comma (except the first case where it starts with the university name), you can see that the match string will either lie in group1 or group2 . 由于学院名称以逗号分隔(第一种情况是以大学名称开头),因此您可以看到匹配字符串位于group1group2

Then you can iterate through group1 & group2 to see if it matches anything in the pre-defined match list & return the value. 然后,您可以遍历group1group2以查看它是否与预定义的匹配列表中的任何内容匹配并返回值。

Code

I have used two examples to show it works. 我用两个例子来说明它的工作原理。

line1 = 'The George Washington University, Washington, DC, USA.'
line2 = 'Department of Pathology, University of Oklahoma Health Sciences Center, Oklahoma City, USA. adekunle-adesina@ouhsc.edu'

matchlist = ['Hospital','University','Institute','School','School','Academy'] # define all keywords that you need look up
p = re.compile('^(.*?),\s+(.*?),(.*?)\.')   # regex pattern to match

# We use a list comprehension using 'any' function to check if any of the item in the matchlist can be found in either group1 or group2 of the pattern match results
line1match = [m.group(1) if any(x in m.group(1) for x in matchlist) else m.group(2) for m in re.finditer(p,line1)]
line2match = [m.group(1) if any(x in m.group(1) for x in matchlist) else m.group(2) for m in re.finditer(p,line2)]

print (line1match)
[Out]: ['The George Washington University']

print (line2match)
[Out]: ['University of Oklahoma Health Sciences Center']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM