在熊猫数据框中形成单词的双元组

Question

我一直在尝试将包含已标记词的 Pandas 数据帧转换为双字母组，但没有成功。 我尝试了多个代码，但我不断收到错误消息或奇怪的答案。 我大约 2 周前才开始使用 python，我真的很挣扎。 任何帮助，将不胜感激。 谢谢

这是我迄今为止尝试过的。

from nltk.util import ngrams

generic_tweets['bigrams'] = generic_tweets['tweet'].apply(lambda row: list(map(lambda x:ngrams(x,2), row)))   
generic_tweets['bigrams'].head()

在哪里

generic_tweets['tweet'].head() 

0         [awww, thats, bummer, shoulda, got, david, car...
1         [upset, that, he, cant, update, his, facebook,...
2         [dived, many, time, ball, managed, save, rest,...
3            [whole, body, feel, itchy, like, it, on, fire]
4         [no, it, not, behaving, at, all, im, mad, why,...
5                                        [not, whole, crew]
6                                               [need, hug]

我想要的是

0         [(awww, thats), (thats, bummer), (bummer, shoulda)...
1         [(upset, that), (that, he), (he, cant), (cant, update)...
2         [(dived, many), (many, time), (time, ball), (ball, managed)...

但我得到的是

0    [<generator object ngrams at 0x000002A38014B84...
1    [<generator object ngrams at 0x000002A30BA0AB1...
2    [<generator object ngrams at 0x000002A3A9182B8...
3    [<generator object ngrams at 0x000002A3A918713...
4    [<generator object ngrams at 0x000002A3A91874F...
Name: bigrams, dtype: object

Answer 1

此输出的原因隐藏在您正在应用的 lambda 函数的主体中：

generic_tweets['bigrams'] = generic_tweets['tweet'].apply(lambda row: list(map(lambda x:ngrams(x,2), row)))

我相信你应该做的是，而不是应用ngrams(x,2) do list(ngrams(row,2))这将摆脱你在答案中得到的生成器，并在单词级别为你提供 ngrams字母：

generi_tweets['bigrams'] = df['tweet'].apply(lambda row: list(nltk.ngrams(row, 2)))

另一件事是，在不包含list情况下访问数据帧中的值也会公开ngrams函数的结果。

Answer 2

如果您的熊猫系列没有数组形式，请使用下面的获取二元组

generic_tweets['bigrams'] = generic_tweets['tweet'].apply(lambda row: list(nltk.bigrams(row.split(' '))))

这类似于

list(nltk.bigrams(['abc', 'def', 'ghi']))

输出将是

[['abc', 'def'], ['def', 'ghi']]

在熊猫数据框中形成单词的双元组

问题描述

2 个解决方案

解决方案1
4 已采纳 2019-02-14 15:52:14

解决方案2
1 2020-09-02 19:53:59

在熊猫数据框中形成单词的双元组

问题描述

2 个解决方案

解决方案1 4 已采纳 2019-02-14 15:52:14

解决方案2 1 2020-09-02 19:53:59

解决方案1
4 已采纳 2019-02-14 15:52:14

解决方案2
1 2020-09-02 19:53:59