简体   繁体   English

字符串包含单词的两个字符之间的正则表达式

[英]Regex between two characters where string contains a word

I'm looking to extract the parent university name from affiliations written in various formats.我希望从以各种格式编写的附属机构中提取父大学名称。 For example:例如:

institute of organic chemistry, rwth aachen university, landoltweg 1, 52074 aachen, germany
school of medical sciences, university of new south wales, save sight institute, university of sydney
save sight institute, university of sydney
unit for laboratory animal medicine, university of michigan, ann arbor 48109
membrane dynamics, department of biology, technische universität darmstadt, schnittspahnstrasse 3, 64287 darmstadt, germany 
university of new south wales, sydney, australia

My thought is generally the parent university is often sandwiched between 2 commas and contains the word "university" (or "universität" and other languages).我的想法通常是父大学通常夹在两个逗号之间,并且包含“大学”(或“大学”和其他语言)这个词。 So my regex is as follows:所以我的正则表达式如下:

(?:,)((.*?university.*?)|(.*?universität.*?))(?:,|$)

However, I'm getting tripped up in the following 2 places:但是,我在以下两个地方被绊倒了:

  1. If the group containing "university" isn't the 2nd comma sandwich (eg, line 5)如果包含“大学”的组不是第二个逗号三明治(例如,第 5 行)
  2. If the group containing "university" is the at the beginning of the full string (eg, line 6)如果包含“大学”的组位于完整字符串的开头(例如,第 6 行)

Also open to other ideas on how to extract this.也对如何提取它的其他想法持开放态度。 I've thought about geocoding the address and then doing a reverse geocode on Google to find the place.我考虑过对地址进行地理编码,然后在 Google 上进行反向地理编码以找到该地点。 However, I have millions of records.但是,我有数百万条记录。

This answer gets me close. 这个答案让我很接近。

I have done something similar, and I have two notes:我做了类似的事情,我有两个笔记:

  1. Google will be better at parsing the information than you will be, so you can defer that to Google Geolocation.谷歌会比你更擅长解析信息,所以你可以把它交给谷歌地理定位。
  2. This isn't really something that a regex would be good at to say the least.至少可以说,这并不是正则表达式擅长的事情。

I have taken your above text and done the following to show an example (sorry it's in python, I don't know r):我已经采取了您上面的文字并执行了以下操作来展示一个示例(对不起,它在 python 中,我不知道 r):

BASE_URL = 'https://maps.googleapis.com/maps/api/geocode/json?key=APIKEY&address='
for line in s.split('\n'):
    requests.get(BASE_URL+line).json()['results'][0]['address_components']

[{u'long_name': u'1', u'types': [u'street_number'], u'short_name': u'1'}, {u'long_name': u'Landoltweg', u'types': [u'route'], u'short_name': u'Landoltweg'}, {u'long_name': u'Aachen', u'types': [u'locality', u'political'], u'short_name': u'AC'}, {u'long_name': u'K\xf6ln', u'types': [u'administrative_area_level_2', u'political'], u'short_name': u'K'}, {u'long_name': u'Nordrhein-Westfalen', u'types': [u'administrative_area_level_1', u'political'], u'short_name': u'NRW'}, {u'long_name': u'Germany', u'types': [u'country', u'political'], u'short_name': u'DE'}, {u'long_name': u'52074', u'types': [u'postal_code'], u'short_name': u'52074'}]
[{u'long_name': u'8', u'types': [u'street_number'], u'short_name': u'8'}, {u'long_name': u'Macquarie Street', u'types': [u'route'], u'short_name': u'Macquarie St'}, {u'long_name': u'Sydney', u'types': [u'locality', u'political'], u'short_name': u'Sydney'}, {u'long_name': u'Council of the City of Sydney', u'types': [u'administrative_area_level_2', u'political'], u'short_name': u'Sydney'}, {u'long_name': u'New South Wales', u'types': [u'administrative_area_level_1', u'political'], u'short_name': u'NSW'}, {u'long_name': u'Australia', u'types': [u'country', u'political'], u'short_name': u'AU'}, {u'long_name': u'2000', u'types': [u'postal_code'], u'short_name': u'2000'}]
[{u'long_name': u'8', u'types': [u'street_number'], u'short_name': u'8'}, {u'long_name': u'Macquarie Street', u'types': [u'route'], u'short_name': u'Macquarie St'}, {u'long_name': u'Sydney', u'types': [u'locality', u'political'], u'short_name': u'Sydney'}, {u'long_name': u'Council of the City of Sydney', u'types': [u'administrative_area_level_2', u'political'], u'short_name': u'Sydney'}, {u'long_name': u'New South Wales', u'types': [u'administrative_area_level_1', u'political'], u'short_name': u'NSW'}, {u'long_name': u'Australia', u'types': [u'country', u'political'], u'short_name': u'AU'}, {u'long_name': u'2000', u'types': [u'postal_code'], u'short_name': u'2000'}]
[{u'long_name': u'2800', u'types': [u'street_number'], u'short_name': u'2800'}, {u'long_name': u'Plymouth Road', u'types': [u'route'], u'short_name': u'Plymouth Rd'}, {u'long_name': u'Northside', u'types': [u'neighborhood', u'political'], u'short_name': u'Northside'}, {u'long_name': u'Ann Arbor', u'types': [u'locality', u'political'], u'short_name': u'Ann Arbor'}, {u'long_name': u'Washtenaw County', u'types': [u'administrative_area_level_2', u'political'], u'short_name': u'Washtenaw County'}, {u'long_name': u'Michigan', u'types': [u'administrative_area_level_1', u'political'], u'short_name': u'MI'}, {u'long_name': u'United States', u'types': [u'country', u'political'], u'short_name': u'US'}, {u'long_name': u'48109', u'types': [u'postal_code'], u'short_name': u'48109'}, {u'long_name': u'2800', u'types': [u'postal_code_suffix'], u'short_name': u'2800'}]
[{u'long_name': u'3', u'types': [u'street_number'], u'short_name': u'3'}, {u'long_name': u'Schnittspahnstra\xdfe', u'types': [u'route'], u'short_name': u'Schnittspahnstra\xdfe'}, {u'long_name': u'Darmstadt-Ost', u'types': [u'political', u'sublocality', u'sublocality_level_1'], u'short_name': u'Darmstadt-Ost'}, {u'long_name': u'Darmstadt', u'types': [u'locality', u'political'], u'short_name': u'Darmstadt'}, {u'long_name': u'Darmstadt', u'types': [u'administrative_area_level_2', u'political'], u'short_name': u'DA'}, {u'long_name': u'Hessen', u'types': [u'administrative_area_level_1', u'political'], u'short_name': u'HE'}, {u'long_name': u'Germany', u'types': [u'country', u'political'], u'short_name': u'DE'}, {u'long_name': u'64287', u'types': [u'postal_code'], u'short_name': u'64287'}]
[{u'long_name': u'Sydney', u'types': [u'locality', u'political'], u'short_name': u'Sydney'}, {u'long_name': u'Randwick City Council', u'types': [u'administrative_area_level_2', u'political'], u'short_name': u'Randwick'}, {u'long_name': u'New South Wales', u'types': [u'administrative_area_level_1', u'political'], u'short_name': u'NSW'}, {u'long_name': u'Australia', u'types': [u'country', u'political'], u'short_name': u'AU'}, {u'long_name': u'2052', u'types': [u'postal_code'], u'short_name': u'2052'}]

Having said that, I cannot imagine you would have 2M unique records you'd want to pass, so you may want to group them (or maybe even use zip code as a key if it exists?) before sending them out.话虽如此,我无法想象您将有 2M 条唯一记录要传递,因此您可能希望在发送它们之前将它们分组(或者甚至使用 zip 代码作为密钥?)。

It may be easier to look for the surrounding text avoiding the commas:避免逗号查找周围的文本可能更容易:

([^,]*(university|universität)[^,]*)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM