简体   繁体   中英

Is there a python Regex to find street name followed by one or multiple persons followed by house number?

I have an image dataset that I am extracting text data from. I have the text as a string but now want to separate this text into a more structured form.

The data looks like this:

Camden Row,Camberwell, S.E—A. Massey, M.D.4.

Campden Hill, Kensington.
(Hornton House).

Campden Hill Road, Kensington.
James, M.D. 6.

Canning Town. E—R. J. Carey (Widdicombe-
co ee

Cannon Street. E.C.—R. Cresswell, 151.

Cannon Street Road. E.—R. W. Lammiman, 106.
—J. R. Morrison, 57.—B. R. Rygate, 126.—
J. J. Rygate, M.B. 126.

Canonbury N. (see foot note)—J. Cheetham, M.D.
(Springjield House),

Canonbury Lane, N.—H. Bateman,
Roberts, 10.—J. Rose, 3.

As you can see it involves a street name followed by a letter representing (N/S/E/W/NW/SE etc.) and then a persons name and house number.

So far I have been using python NLTK. I am able to extract streets, names and numbers as individual entities using:

tagged = nltk.pos_tag(tokens)

What I would like to achieve is a list of:

[street name, person, house_number]

For example:

[[Cannon Street Road, R. W. Lammiman, 106][Cannon Street Road, J. R. Morrison, 57]]

My plan was to use the street names as an anchor for the start and then the digit for an anchor at the end but this is a bit more complicated due to multiple house numbers on each street.

Can anyone suggest a method/regex that might work for this?

Thank you kindly if so. James.

You can split the string if that is a consistent format.

text = "Cannon Street Road. E.—R. W. Lammiman, 106.—J. R. Morrison, 57.—B. R. Rygate, 126.—J. J. Rygate, M.B. 126."
text = text.split("—")
infos = list()
streetname = text[0]

for i in text[1:]:
    infos.append([streetname, i])

print(infos)

The result is: [['Cannon Street Road. E.', 'RW Lammiman, 106.'], ['Cannon Street Road. E.', 'J. R. Morrison, 57.'], ['Cannon Street Road. E.', 'B. R. Rygate, 126.'], ['Cannon Street Road. E.', 'JJ Rygate, MB 126.']]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM