简体   繁体   中英

extracting the 2 words before, the actual word, and the 2 strings after a specific string in python?

I have a Pandas series

       Explanation 

a      "how are you doing today where is she going" 
b      "do you like blueberry ice cream does not make sure " 
c      "this works but you know that the translation is on" 

I want to extract the 2 words before and after the string "you"

for example, I want it to be something like

        Explanation                                                    Explanation Extracted

a      "how are you doing today where is she going"                  "how are you doing today"
b      "do you like blueberry ice cream does not make sure "         do you like blueberry ice 
c      "this works but you know that the translation is on"           "work but you know that"

This regex expression gives me the the two words before and after "you", but doesn't include "you" itself

(?P<before>(?:\w+\W+){,2})you\W+(?P<after>(?:\w+\W+){,2})

How do I change it so I can have "you" included

You can use

df['Explanation Extracted'] = df['Explanation'].str.extract(r'\b((?:\w+\W+){0,2}you\b(?:\W+\w+){0,2})', expand=False)

See the regex demo .

Details :

  • \b - a word boundary
  • (?:\w+\W+){0,2} - zero, one or two occurrences of one or more word chars and then one or more non-word chars
  • you - a you string
  • \b - a word boundary
  • (?:\W+\w+){0,2} - zero, one or two occurrences of one or more non-word chars and then one or more word chars.

A Pandas test:

>>> import pandas as pd
>>> df = pd.DataFrame({'Explanation':["how are you doing today where is she going", "do you like blueberry ice cream does not make sure ", "this works but you know that the translation is on"]})
>>> df['Explanation Extracted'] = df['Explanation'].str.extract(r'\b((?:\w+\W+){0,2}you\b(?:\W+\w+){0,2})', expand=False)
>>> df
                                         Explanation    Explanation Extracted
0         how are you doing today where is she going  how are you doing today
1  do you like blueberry ice cream does not make ...    do you like blueberry
2  this works but you know that the translation i...  works but you know that

I will show a way with no regex and no pandas, for this case I dont see it needed.

text1 = "how are you doing today where is she going"
text2 = "do you like blueberry ice cream does not make sure "
text3 = "this works but you know that the translation is on"


def show_trunc_sentence(text, word='you'): # here you can choose another word besides you but you is the default
    word_loc = int(text.split().index('you'))
    num = [word_loc - 2 if word_loc - 2 >= 0 else 0]
    num = int(num[0])
    before = text.split()[num: word_loc + 1]
    after = text.split()[word_loc + 1:word_loc + 3]
    print(" ".join(before + after))


    show_trunc_sentence(text2)

Outputs : text1 - how are you doing today text2 - do you like blueberry text3 - works but you know that

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM