简体   繁体   中英

How to locate string and substring in sentences

I am trying to locate items (one of them is the substring of the other) in sentences with regular expression, but it always locates the substring. For example, there are two items ["The Duke", "The Duke of A"] and some sentences:

The Duke

The Duke is a movie.

How is the movie The Duke?

The Duke of A

The Duke of A is a movie.

How is the movie The Duke of A?

What I want after finding the locations are:

The_Duke

The_Duke is a movie.

How is the movie The_Duke?

The_Duke_of_A

The_Duke_of_A is a movie.

How is the movie The_Duke_of_A?

The code I have tried is:

for sent in sentences:
    for item in ["The Duke", "The Duke of A"]:
        find = re.search(r'{0}'.format(item), sent)
        if find:
           sent = sent.replace(sent[find.start():find.end()], item.replace(" ", "_"))    

But I got:

The_Duke

The_Duke is a movie.

How is the movie The_Duke?

The_Duke of A

The_Duke of A is a movie.

How is the movie The_Duke of A?

Changing the position of the items in the list is not suitable in my case, as I have a large list (over 10,000 items).

You can use re.sub and the repl can be a function so just replace the spaces in the results.

import re

with open("filename.txt") as sentences:
    for line in sentences:
        print(re.sub(r"The Duke of A|The Duke",
                     lambda s: s[0].replace(' ', '_'),
                     line))

This results in:

The_Duke

The_Duke is a movie.

How is the movie The_Duke?

The_Duke_of_A

The_Duke_of_A is a movie.

How is the movie The_Duke_of_A?

What you are doing is first looking for "The Duke". If re find any match then you replaced it with "The_Duke". Now the second pass of the loop is looking for "The Duke of A" but re can't find any match because you have changed it previously.

This should work.

for sent in sentences:
for item in ["The Duke of A", "The Duke"]:
    find = re.search(r'{0}'.format(item), sent)
    if find:
       sent = sent.replace(sent[find.start():find.end()], item.replace(" ", "_"))

If you cannot change position of the items in the list, you could try this version. In first pass we collect all matches and in the second pass we do the substitution:

data = '''The Duke
The Duke is a movie.
How is the movie The Duke?
The Duke of A
The Duke of A is a movie.
How is the movie The Duke of A?'''

terms = ["The Duke", "The Duke of A"]

import re

to_change = []
for t in terms:
    for g in re.finditer(t, data):
        to_change.append((g.start(), g.end()))

for (start, end) in to_change:
    data = data[:start] + re.sub(r'\s', r'_', data[start:end]) + data[end:]

print(data)

Prints:

The_Duke
The_Duke is a movie.
How is the movie The_Duke?
The_Duke_of_A
The_Duke_of_A is a movie.
How is the movie The_Duke_of_A?

Swap position of 'The Duke of A' and 'The Duke' in line:

for item in ["The Duke", "The Duke of A"]:

become

for item in ["The Duke of A", "The Duke"]:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM