简体   繁体   中英

Get a position of n-gram words in a sentence

In python, I want to get a position of a word(s) in a sentence. The matching strings could be several words.

sentence = "Bloomberg announced today that Gordian Capital will implement the solution to help its clients pursue new fund opportunities faster."

search_str = "Bloomberg" 

Expected output:

0

A string to match could be composed of several words. In this case I want to get the position of the beginning.

search_str = "Gordian Capital" 

Expected output:

4

A search_str could be a combination of special character and numbers as well such as $5.1 billion . I tried something like this but it splits the original sentence into words and I don't know how I can handle n-gram case.

result = [i+1 for i,w in enumerate(sentence.split()) if w == search_str]

Any solution would be appreciated. Thanks

  1. Split sentence using search_str

result = sentence.split(search_str)

  1. Take the first element of the result and split it by spaces

result = result[0].split(' ')

It may seem that is done, just need to count the elements in resulting list with

len(result)

but sometimes, an empty element could pe present.

To avoid this, list has to be filtered

result = [elem for elem in filter(lambda x: x!="", result)]

print(len(result))

And all of this you can write just in one line:

result = len([elem for elem in filter(lambda x: x != "", sentence.split(search_str)[0].split(" ")) ])

Try enumeration.

Since you're only really looking for the position of the first word in any search string, we can split that too just try to match the first word.

Here's a one-liner that solves the issue:

search_str = "Gordian Capital"

[k for k, v in enumerate(sentence.split()) if v.lower() == search_str.split()[0].lower()]

Result:

[4]

Here's a sentence with more than one Gordian Capital.

sentence = "the Bloomberg announced today that Gordian Capital will implement the solution to help Gordian Capital's clients pursue new fund opportunities faster, says Gordian Capital."

[k for k, v in enumerate(sentence.split()) if v.lower() == search_str.split()[0].lower()]

Result:

[5, 13, 22]

Note: Since Python is case sensitive, we put our terms in lowercase for better matching.

This part:

search_str.split()[0].lower()

Splits on the white space character (by default), then we grab the first item and reformat as lowercase for our target to match.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM