After tokenizing a column, get 2 words before and after a specific word

Question

I just tokenized a column in a dataframe using nltk.word_tokenize. This column now looks like

df.tokenized
> 0     [apple, hi, dog, boy, why...]
> 1     [table, hey, girl, cat, dog, 2, 3...

For each row, I need to get 2 words before and 2 words after the word "dog". I want to put it all in another column in the same dataframe. The output I expect is something like:

df.tokenized_part2
> 0     [apple, hi, dog, boy, why]
> 1     [girl, cat, dog, 2, 3]

So I need to create this column tokenized_part2.

If you need this info: tokenized - object

Does someone know how to do that?

Answer 1

You can use apply() to run function on every cell in column and this function may get position of dog on list and return [pos-2:pos+3]

import pandas as pd

df = pd.DataFrame({
"tokenized": [
    ['apple', 'hi', 'dog', 'boy', 'why', 'other'],
    ['table', 'hey', 'girl', 'cat', 'dog', '2', '3'],
    ['A', 'B', 'C'],
]
})

def process(words):
    #print(words)
    if 'dog' in words:
        pos = words.index('dog')
        return words[pos-2:pos+3]
    else:
        #return words
        return []

df["tokenized_2"] = df["tokenized"].apply(process)

print(df)

Result:

                            tokenized                 tokenized_2
0   [apple, hi, dog, boy, why, other]  [apple, hi, dog, boy, why]
1  [table, hey, girl, cat, dog, 2, 3]      [girl, cat, dog, 2, 3]
2                           [A, B, C]                          []

EDIT:

To make it more universal it could get dog (or other word) as parameter and then you would have run it with lambda (or partial )

import pandas as pd

df = pd.DataFrame({
"tokenized": [
    ['apple', 'hi', 'dog', 'boy', 'why', 'other'],
    ['table', 'hey', 'girl', 'cat', 'dog', '2', '3'],
    ['A', 'B', 'C'],
]
})

def process(words, search):
    #print(words)
    if search in words:
        pos = words.index(search)
        return words[pos-2:pos+3]
    else:
        #return words
        return []

df["tokenized_dog"] = df["tokenized"].apply(lambda words:process(words, 'dog'))
df["tokenized_cat"] = df["tokenized"].apply(lambda words:process(words, 'cat'))

print(df[["tokenized_dog", "tokenized_cat"]])

Result:

                tokenized_dog             tokenized_cat
0  [apple, hi, dog, boy, why]                        []
1      [girl, cat, dog, 2, 3]  [hey, girl, cat, dog, 2]
2                          []                        []

Answer 2

One way to do it is using apply with lambda like

lambda x: [i for ix,i in enumerate(x) if ix in range([idx for idx,it in x if it=='dog'][0]-2,[idx for idx,it in x if it=='dog'][0]+2)]

But it's computationally expensive, error prone, and probably needlessly complex.

After tokenizing a column, get 2 words before and after a specific word

Question

2 answers

solution1
1 ACCPTED 2021-05-18 23:08:57

solution2
-2 2021-05-18 22:30:22

After tokenizing a column, get 2 words before and after a specific word

Question

2 answers

solution1 1 ACCPTED 2021-05-18 23:08:57

solution2 -2 2021-05-18 22:30:22

solution1
1 ACCPTED 2021-05-18 23:08:57

solution2
-2 2021-05-18 22:30:22