简体   繁体   中英

Regex between two characters where string contains a word

I'm looking to extract the parent university name from affiliations written in various formats. For example:

institute of organic chemistry, rwth aachen university, landoltweg 1, 52074 aachen, germany
school of medical sciences, university of new south wales, save sight institute, university of sydney
save sight institute, university of sydney
unit for laboratory animal medicine, university of michigan, ann arbor 48109
membrane dynamics, department of biology, technische universität darmstadt, schnittspahnstrasse 3, 64287 darmstadt, germany 
university of new south wales, sydney, australia

My thought is generally the parent university is often sandwiched between 2 commas and contains the word "university" (or "universität" and other languages). So my regex is as follows:

(?:,)((.*?university.*?)|(.*?universität.*?))(?:,|$)

However, I'm getting tripped up in the following 2 places:

  1. If the group containing "university" isn't the 2nd comma sandwich (eg, line 5)
  2. If the group containing "university" is the at the beginning of the full string (eg, line 6)

Also open to other ideas on how to extract this. I've thought about geocoding the address and then doing a reverse geocode on Google to find the place. However, I have millions of records.

This answer gets me close.

I have done something similar, and I have two notes:

  1. Google will be better at parsing the information than you will be, so you can defer that to Google Geolocation.
  2. This isn't really something that a regex would be good at to say the least.

I have taken your above text and done the following to show an example (sorry it's in python, I don't know r):

BASE_URL = 'https://maps.googleapis.com/maps/api/geocode/json?key=APIKEY&address='
for line in s.split('\n'):
    requests.get(BASE_URL+line).json()['results'][0]['address_components']

[{u'long_name': u'1', u'types': [u'street_number'], u'short_name': u'1'}, {u'long_name': u'Landoltweg', u'types': [u'route'], u'short_name': u'Landoltweg'}, {u'long_name': u'Aachen', u'types': [u'locality', u'political'], u'short_name': u'AC'}, {u'long_name': u'K\xf6ln', u'types': [u'administrative_area_level_2', u'political'], u'short_name': u'K'}, {u'long_name': u'Nordrhein-Westfalen', u'types': [u'administrative_area_level_1', u'political'], u'short_name': u'NRW'}, {u'long_name': u'Germany', u'types': [u'country', u'political'], u'short_name': u'DE'}, {u'long_name': u'52074', u'types': [u'postal_code'], u'short_name': u'52074'}]
[{u'long_name': u'8', u'types': [u'street_number'], u'short_name': u'8'}, {u'long_name': u'Macquarie Street', u'types': [u'route'], u'short_name': u'Macquarie St'}, {u'long_name': u'Sydney', u'types': [u'locality', u'political'], u'short_name': u'Sydney'}, {u'long_name': u'Council of the City of Sydney', u'types': [u'administrative_area_level_2', u'political'], u'short_name': u'Sydney'}, {u'long_name': u'New South Wales', u'types': [u'administrative_area_level_1', u'political'], u'short_name': u'NSW'}, {u'long_name': u'Australia', u'types': [u'country', u'political'], u'short_name': u'AU'}, {u'long_name': u'2000', u'types': [u'postal_code'], u'short_name': u'2000'}]
[{u'long_name': u'8', u'types': [u'street_number'], u'short_name': u'8'}, {u'long_name': u'Macquarie Street', u'types': [u'route'], u'short_name': u'Macquarie St'}, {u'long_name': u'Sydney', u'types': [u'locality', u'political'], u'short_name': u'Sydney'}, {u'long_name': u'Council of the City of Sydney', u'types': [u'administrative_area_level_2', u'political'], u'short_name': u'Sydney'}, {u'long_name': u'New South Wales', u'types': [u'administrative_area_level_1', u'political'], u'short_name': u'NSW'}, {u'long_name': u'Australia', u'types': [u'country', u'political'], u'short_name': u'AU'}, {u'long_name': u'2000', u'types': [u'postal_code'], u'short_name': u'2000'}]
[{u'long_name': u'2800', u'types': [u'street_number'], u'short_name': u'2800'}, {u'long_name': u'Plymouth Road', u'types': [u'route'], u'short_name': u'Plymouth Rd'}, {u'long_name': u'Northside', u'types': [u'neighborhood', u'political'], u'short_name': u'Northside'}, {u'long_name': u'Ann Arbor', u'types': [u'locality', u'political'], u'short_name': u'Ann Arbor'}, {u'long_name': u'Washtenaw County', u'types': [u'administrative_area_level_2', u'political'], u'short_name': u'Washtenaw County'}, {u'long_name': u'Michigan', u'types': [u'administrative_area_level_1', u'political'], u'short_name': u'MI'}, {u'long_name': u'United States', u'types': [u'country', u'political'], u'short_name': u'US'}, {u'long_name': u'48109', u'types': [u'postal_code'], u'short_name': u'48109'}, {u'long_name': u'2800', u'types': [u'postal_code_suffix'], u'short_name': u'2800'}]
[{u'long_name': u'3', u'types': [u'street_number'], u'short_name': u'3'}, {u'long_name': u'Schnittspahnstra\xdfe', u'types': [u'route'], u'short_name': u'Schnittspahnstra\xdfe'}, {u'long_name': u'Darmstadt-Ost', u'types': [u'political', u'sublocality', u'sublocality_level_1'], u'short_name': u'Darmstadt-Ost'}, {u'long_name': u'Darmstadt', u'types': [u'locality', u'political'], u'short_name': u'Darmstadt'}, {u'long_name': u'Darmstadt', u'types': [u'administrative_area_level_2', u'political'], u'short_name': u'DA'}, {u'long_name': u'Hessen', u'types': [u'administrative_area_level_1', u'political'], u'short_name': u'HE'}, {u'long_name': u'Germany', u'types': [u'country', u'political'], u'short_name': u'DE'}, {u'long_name': u'64287', u'types': [u'postal_code'], u'short_name': u'64287'}]
[{u'long_name': u'Sydney', u'types': [u'locality', u'political'], u'short_name': u'Sydney'}, {u'long_name': u'Randwick City Council', u'types': [u'administrative_area_level_2', u'political'], u'short_name': u'Randwick'}, {u'long_name': u'New South Wales', u'types': [u'administrative_area_level_1', u'political'], u'short_name': u'NSW'}, {u'long_name': u'Australia', u'types': [u'country', u'political'], u'short_name': u'AU'}, {u'long_name': u'2052', u'types': [u'postal_code'], u'short_name': u'2052'}]

Having said that, I cannot imagine you would have 2M unique records you'd want to pass, so you may want to group them (or maybe even use zip code as a key if it exists?) before sending them out.

It may be easier to look for the surrounding text avoiding the commas:

([^,]*(university|universität)[^,]*)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM