如何使用正则表达式从python的字符串中提取大学/学校/学院的名称？

Question

SAMPLE CODE 样本代码

import re
line = "should we use regex more often, University of Pennsylvania. let me know at  321dsasdsa@dasdsa.com.lol"
match = re.search(r'/([A-Z][^\s,.]+[.]?\s[(]?)*(Hospital|University|Institute|Law School|School of|Academy)[^,\d]*(?=,|\d)/', line)
print(match.group(0))

I'm trying to extract University/School/Organization names from given string using regular expression in python but it gives an error message. 我正在尝试使用python中的正则表达式从给定的字符串中提取大学/学校/组织名称，但它给出了一条错误消息。

ERROR MESSAGE 错误信息

Traceback (most recent call last): File "C:/Python/addOrganization.py", line 4, in print(match.group(0)) AttributeError: 'NoneType' object has no attribute 'group' 追溯（最近一次通话）：文件“ C：/Python/addOrganization.py”，行4，在print（match.group（0））中AttributeError：'NoneType'对象没有属性'group'

Answer 1

Instead of search ,Try the re.sub to print your expected output 代替搜索，尝试re.sub打印您的预期输出

import re
i = "should we use regex more often, University of Pennsylvania. let me know at  321dsasdsa@dasdsa.com.lol"
line = re.sub(r"[\w\W]* ((Hospital|University|Centre|Law School|School|Academy|Department)[\w -]*)[\w\W]*$", r"\1", i)
print line

Answer 2

The test string you've given is a made up one since the University name is immediately followed by a line terminator '.' 您给出的测试字符串是一个组成的字符串，因为大学名称后紧跟一个行终止符'。 while the other examples in your pastebin sample do not (they are followed by a comma). 而pastebin示例中的其他示例则没有（它们后面是逗号）。

line = should we use regex more often, University of Pennsylvania. let me know at 321dsasdsa@dasdsa.com.lol

I have managed to extract the names using a simple regex for examples in your pastebin you can see details here: regex101.com 我已经使用简单的正则表达式提取了名称，例如您的pastebin示例，您可以在此处查看详细信息： regex101.com

Logic 逻辑

Since the institute name is separated by a comma (except the first case where it starts with the university name), you can see that the match string will either lie in group1 or group2 . 由于学院名称以逗号分隔（第一种情况是以大学名称开头），因此您可以看到匹配字符串位于group1或group2 。

Then you can iterate through group1 & group2 to see if it matches anything in the pre-defined match list & return the value. 然后，您可以遍历group1和group2以查看它是否与预定义的匹配列表中的任何内容匹配并返回值。

Code 码

I have used two examples to show it works. 我用两个例子来说明它的工作原理。

line1 = 'The George Washington University, Washington, DC, USA.'
line2 = 'Department of Pathology, University of Oklahoma Health Sciences Center, Oklahoma City, USA. adekunle-adesina@ouhsc.edu'

matchlist = ['Hospital','University','Institute','School','School','Academy'] # define all keywords that you need look up
p = re.compile('^(.*?),\s+(.*?),(.*?)\.')   # regex pattern to match

# We use a list comprehension using 'any' function to check if any of the item in the matchlist can be found in either group1 or group2 of the pattern match results
line1match = [m.group(1) if any(x in m.group(1) for x in matchlist) else m.group(2) for m in re.finditer(p,line1)]
line2match = [m.group(1) if any(x in m.group(1) for x in matchlist) else m.group(2) for m in re.finditer(p,line2)]

print (line1match)
[Out]: ['The George Washington University']

print (line2match)
[Out]: ['University of Oklahoma Health Sciences Center']

如何使用正则表达式从python的字符串中提取大学/学校/学院的名称？

问题描述

2 个解决方案

解决方案1
0 2018-12-06 07:16:13

解决方案2
0 2018-12-06 10:16:46

Logic 逻辑

Code 码

如何使用正则表达式从python的字符串中提取大学/学校/学院的名称？

问题描述

2 个解决方案

解决方案1 0 2018-12-06 07:16:13

解决方案2 0 2018-12-06 10:16:46

Logic 逻辑

Code 码

解决方案1
0 2018-12-06 07:16:13

解决方案2
0 2018-12-06 10:16:46