简体   繁体   中英

After tokenizing a column, get 2 words before and after a specific word

I just tokenized a column in a dataframe using nltk.word_tokenize. This column now looks like

df.tokenized
> 0     [apple, hi, dog, boy, why...]
> 1     [table, hey, girl, cat, dog, 2, 3...

For each row, I need to get 2 words before and 2 words after the word "dog". I want to put it all in another column in the same dataframe. The output I expect is something like:

df.tokenized_part2
> 0     [apple, hi, dog, boy, why]
> 1     [girl, cat, dog, 2, 3]

So I need to create this column tokenized_part2.

If you need this info: tokenized - object

Does someone know how to do that?

You can use apply() to run function on every cell in column and this function may get position of dog on list and return [pos-2:pos+3]

import pandas as pd

df = pd.DataFrame({
"tokenized": [
    ['apple', 'hi', 'dog', 'boy', 'why', 'other'],
    ['table', 'hey', 'girl', 'cat', 'dog', '2', '3'],
    ['A', 'B', 'C'],
]
})

def process(words):
    #print(words)
    if 'dog' in words:
        pos = words.index('dog')
        return words[pos-2:pos+3]
    else:
        #return words
        return []

df["tokenized_2"] = df["tokenized"].apply(process)

print(df)

Result:

                            tokenized                 tokenized_2
0   [apple, hi, dog, boy, why, other]  [apple, hi, dog, boy, why]
1  [table, hey, girl, cat, dog, 2, 3]      [girl, cat, dog, 2, 3]
2                           [A, B, C]                          []

EDIT:

To make it more universal it could get dog (or other word) as parameter and then you would have run it with lambda (or partial )

import pandas as pd

df = pd.DataFrame({
"tokenized": [
    ['apple', 'hi', 'dog', 'boy', 'why', 'other'],
    ['table', 'hey', 'girl', 'cat', 'dog', '2', '3'],
    ['A', 'B', 'C'],
]
})

def process(words, search):
    #print(words)
    if search in words:
        pos = words.index(search)
        return words[pos-2:pos+3]
    else:
        #return words
        return []

df["tokenized_dog"] = df["tokenized"].apply(lambda words:process(words, 'dog'))
df["tokenized_cat"] = df["tokenized"].apply(lambda words:process(words, 'cat'))

print(df[["tokenized_dog", "tokenized_cat"]])

Result:

                tokenized_dog             tokenized_cat
0  [apple, hi, dog, boy, why]                        []
1      [girl, cat, dog, 2, 3]  [hey, girl, cat, dog, 2]
2                          []                        []

One way to do it is using apply with lambda like

lambda x: [i for ix,i in enumerate(x) if ix in range([idx for idx,it in x if it=='dog'][0]-2,[idx for idx,it in x if it=='dog'][0]+2)]

But it's computationally expensive, error prone, and probably needlessly complex.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM