![](/img/trans.png)
[英]pandas remove all words before a specific word and get the first n words after that specific word
[英]After tokenizing a column, get 2 words before and after a specific word
我刚刚使用 nltk.word_tokenize 标记了 dataframe 中的一列。 此列现在看起来像
df.tokenized
> 0 [apple, hi, dog, boy, why...]
> 1 [table, hey, girl, cat, dog, 2, 3...
对于每一行,我需要在“dog”这个词之前得到 2 个词,在“dog”这个词之后得到 2 个词。 我想把它放在同一个 dataframe 的另一列中。 我期望的 output 类似于:
df.tokenized_part2
> 0 [apple, hi, dog, boy, why]
> 1 [girl, cat, dog, 2, 3]
所以我需要创建这个列tokenized_part2。
如果您需要此信息:tokenized - object
有人知道该怎么做吗?
您可以使用apply()
在列中的每个单元格上运行 function 并且此 function 可能会得到 position of dog
在列表中并返回[pos-2:pos+3]
import pandas as pd
df = pd.DataFrame({
"tokenized": [
['apple', 'hi', 'dog', 'boy', 'why', 'other'],
['table', 'hey', 'girl', 'cat', 'dog', '2', '3'],
['A', 'B', 'C'],
]
})
def process(words):
#print(words)
if 'dog' in words:
pos = words.index('dog')
return words[pos-2:pos+3]
else:
#return words
return []
df["tokenized_2"] = df["tokenized"].apply(process)
print(df)
结果:
tokenized tokenized_2
0 [apple, hi, dog, boy, why, other] [apple, hi, dog, boy, why]
1 [table, hey, girl, cat, dog, 2, 3] [girl, cat, dog, 2, 3]
2 [A, B, C] []
编辑:
为了使其更通用,它可以将dog
(或其他词)作为参数,然后您将使用lambda
(或partial
)运行它
import pandas as pd
df = pd.DataFrame({
"tokenized": [
['apple', 'hi', 'dog', 'boy', 'why', 'other'],
['table', 'hey', 'girl', 'cat', 'dog', '2', '3'],
['A', 'B', 'C'],
]
})
def process(words, search):
#print(words)
if search in words:
pos = words.index(search)
return words[pos-2:pos+3]
else:
#return words
return []
df["tokenized_dog"] = df["tokenized"].apply(lambda words:process(words, 'dog'))
df["tokenized_cat"] = df["tokenized"].apply(lambda words:process(words, 'cat'))
print(df[["tokenized_dog", "tokenized_cat"]])
结果:
tokenized_dog tokenized_cat
0 [apple, hi, dog, boy, why] []
1 [girl, cat, dog, 2, 3] [hey, girl, cat, dog, 2]
2 [] []
一种方法是使用 apply 与 lambda 之类的
lambda x: [i for ix,i in enumerate(x) if ix in range([idx for idx,it in x if it=='dog'][0]-2,[idx for idx,it in x if it=='dog'][0]+2)]
但它的计算成本很高,容易出错,而且可能是不必要的复杂。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.