I just tokenized a column in a dataframe using nltk.word_tokenize. This column now looks like
df.tokenized
> 0 [apple, hi, dog, boy, why...]
> 1 [table, hey, girl, cat, dog, 2, 3...
For each row, I need to get 2 words before and 2 words after the word "dog". I want to put it all in another column in the same dataframe. The output I expect is something like:
df.tokenized_part2
> 0 [apple, hi, dog, boy, why]
> 1 [girl, cat, dog, 2, 3]
So I need to create this column tokenized_part2.
If you need this info: tokenized - object
Does someone know how to do that?
You can use apply()
to run function on every cell in column and this function may get position of dog
on list and return [pos-2:pos+3]
import pandas as pd
df = pd.DataFrame({
"tokenized": [
['apple', 'hi', 'dog', 'boy', 'why', 'other'],
['table', 'hey', 'girl', 'cat', 'dog', '2', '3'],
['A', 'B', 'C'],
]
})
def process(words):
#print(words)
if 'dog' in words:
pos = words.index('dog')
return words[pos-2:pos+3]
else:
#return words
return []
df["tokenized_2"] = df["tokenized"].apply(process)
print(df)
Result:
tokenized tokenized_2
0 [apple, hi, dog, boy, why, other] [apple, hi, dog, boy, why]
1 [table, hey, girl, cat, dog, 2, 3] [girl, cat, dog, 2, 3]
2 [A, B, C] []
EDIT:
To make it more universal it could get dog
(or other word) as parameter and then you would have run it with lambda
(or partial
)
import pandas as pd
df = pd.DataFrame({
"tokenized": [
['apple', 'hi', 'dog', 'boy', 'why', 'other'],
['table', 'hey', 'girl', 'cat', 'dog', '2', '3'],
['A', 'B', 'C'],
]
})
def process(words, search):
#print(words)
if search in words:
pos = words.index(search)
return words[pos-2:pos+3]
else:
#return words
return []
df["tokenized_dog"] = df["tokenized"].apply(lambda words:process(words, 'dog'))
df["tokenized_cat"] = df["tokenized"].apply(lambda words:process(words, 'cat'))
print(df[["tokenized_dog", "tokenized_cat"]])
Result:
tokenized_dog tokenized_cat
0 [apple, hi, dog, boy, why] []
1 [girl, cat, dog, 2, 3] [hey, girl, cat, dog, 2]
2 [] []
One way to do it is using apply with lambda like
lambda x: [i for ix,i in enumerate(x) if ix in range([idx for idx,it in x if it=='dog'][0]-2,[idx for idx,it in x if it=='dog'][0]+2)]
But it's computationally expensive, error prone, and probably needlessly complex.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.