简体   繁体   中英

How to Extract Words Following a Key Word

I'm currently trying to extract 4 words after "our", but keep getting words after "hour" and "your" as well.

ie) "my family will send an email in 2 hours when we arrive at." (text in the column)

What I want: nan (since there is no "our")

What I get: when we arrive at (because hour as "our" in it)

I tried the following code and still have no luck.

our = 'our\W+(?P<after>(?:\w+\W+){,4})' 
Reviews_C['Review_for_Fam'] =Reviews_C.ReviewText2.str.extract(our, expand=True)

Can you please help?

Thank you!

Im suprised to see regex used for this due to it causing unneeded complexity sometimes. Could something like this work?

def extract_next_words(sentence):
    # split the sentence into words
    words = sentence.split()
    
    # find the index of "our"
    index = words.index("our")

    # extract the next 4 words
    next_words = words[index+1:index+5]

    # join the words into a string
    return " ".join(next_words)

You need to make sure "our" is with space boundaries, like this:

our = '(^|\s+)our(\s+)?\W+(?P<after>(?:\w+\W+){,4})'

specifically (^|\s+)our(\s+)? is where you need to play, the example only handles spaces and start of sentence, but you might need to extend this to have quotes or other special characters.

Here is the generic code for finding the n number of words after a specific 'x' word in the string. It also accounts for multiple occurrences of 'x' as well as for non-occurrence.

def find_n_word_after_x(in_str, x, n):
    in_str_wrds = in_str.strip().split()
    x = x.strip()
    if x in in_str_wrds:
        out_lst = []
        for i, i_val in enumerate(in_str_wrds):
            if i_val == x:
                if i+n < len(in_str_wrds):
                    out_str = in_str_wrds[i+1:i+1+n]
                    out_lst.append(" ".join(out_str))
        return out_lst
    else:
        return []
str1 = "our w1 w2 w3 w4 w5 w6"
str2 = "our w1 w2 our w3 w4 w5 w6"
str3 = "w1 w2 w3 w4 our w5 w6"
str4 = "w1"

print(find_n_word_after_x(str1, 'our', 4))
print(find_n_word_after_x(str2, 'our', 4))
print(find_n_word_after_x(str3, 'our', 4))
print(find_n_word_after_x(str4, 'our', 4))

Generated Output:

['w1 w2 w3 w4']
['w1 w2 our w3', 'w3 w4 w5 w6']
[]
[]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM