遍历大熊猫中的行并计算唯一的主题标签

Question

I have a csv file containing thousands of tweets. 我有一个包含数千条推文的csv文件。 Lets say the data is as follows: 可以说数据如下：

Tweet_id   hashtags_in_the_tweet

Tweet_1    [trump, clinton]
Tweet_2    [trump, sanders]
Tweet_3    [politics, news]
Tweet_4    [news, trump]
Tweet_5    [flower, day]
Tweet_6    [trump, impeach]

as you can see, the data contains tweet_id and the hashtags in each tweet. 如您所见，数据包含tweet_id和每个tweet中的主题标签。 What I want to do is to go to all the rows, and at last give me something like value count: 我想要做的是转到所有行，最后给我类似值计数的内容：

Hashtag    count
trump      4
news       2
clinton    1
sanders    1
politics   1
flower     1
obama      1
impeach    1

Considering that the csv file contains 1 million rows (1 million tweets), what is the best way to do this? 考虑到csv文件包含一百万行（一百万条推文），什么是最好的方法？

Answer 1

`Counter` + `chain` `Counter` + `chain`

Pandas methods aren't designed for series of lists. 熊猫方法不适用于一系列列表。 No vectorised approach exists. 不存在矢量化方法。 One way is to use collections.Counter from the standard library: 一种方法是使用标准库中的collections.Counter ：

from collections import Counter
from itertools import chain

c = Counter(chain.from_iterable(df['hashtags_in_the_tweet'].values.tolist()))

res = pd.DataFrame(c.most_common())\
        .set_axis(['Hashtag', 'count'], axis=1, inplace=False)

print(res)

    Hashtag  count
0     trump      4
1      news      2
2   clinton      1
3   sanders      1
4  politics      1
5    flower      1
6       day      1
7   impeach      1

Setup 设定

df = pd.DataFrame({'Tweet_id': [f'Tweet_{i}' for i in range(1, 7)],
                   'hashtags_in_the_tweet': [['trump', 'clinton'], ['trump', 'sanders'], ['politics', 'news'],
                                             ['news', 'trump'], ['flower', 'day'], ['trump', 'impeach']]})

print(df)

  Tweet_id hashtags_in_the_tweet
0  Tweet_1      [trump, clinton]
1  Tweet_2      [trump, sanders]
2  Tweet_3      [politics, news]
3  Tweet_4         [news, trump]
4  Tweet_5         [flower, day]
5  Tweet_6      [trump, impeach]

Answer 2

One alternative with np.hstack and convert to pd.Series then use value_counts . 一种使用np.hstack并转换为pd.Series然后使用value_counts 。

import numpy as np

df = pd.Series(np.hstack(df['hashtags_in_the_tweet'])).value_counts().to_frame('count')

df = df.rename_axis('Hashtag').reset_index()

print (df)

    Hashtag  count
0     trump      4
1      news      2
2   sanders      1
3   impeach      1
4   clinton      1
5    flower      1
6  politics      1
7       day      1

Answer 3

Using np.unique 使用np.unique

v,c=np.unique(np.concatenate(df.hashtags_in_the_tweet.values),return_counts=True)

#pd.DataFrame({'Hashtag':v,'Count':c})

Even the problem look different , but still is related unnesting problem 甚至问题看起来都不一样，但仍然是相关的嵌套问题

unnesting(df,['hashtags_in_the_tweet'])['hashtags_in_the_tweet'].value_counts()

Answer 4

Sounds like you want something like collections.Counter , which you might use like this... 听起来像您想要的是collections.Counter之类的东西，您可能会这样使用...

from collections import Counter
from functools import reduce 
import operator
import pandas as pd 

fold = lambda f, acc, xs: reduce(f, xs, acc)
df = pd.DataFrame({'Tweet_id': ['Tweet_%s'%i for i in range(1, 7)],
                   'hashtags':[['t', 'c'], ['t', 's'], 
                               ['p','n'], ['n', 't'], 
                               ['f', 'd'], ['t', 'i', 'c']]})
fold(operator.add, Counter(), [Counter(x) for x in df.hashtags.values])

which gives you, 这给你，

Counter({'c': 2, 'd': 1, 'f': 1, 'i': 1, 'n': 2, 'p': 1, 's': 1, 't': 4})

Edit: I think jpp's answer will be quite a bit faster. 编辑：我认为jpp的答案会快很多。 If time really is a constraint, I would avoid reading the data into a DataFrame in the first place. 如果时间确实是一个约束，那么我将避免首先将数据读取到DataFrame中。 I don't know what the raw csv file looks like, but reading it as a text file by lines, ignoring the first token, and feeding the rest into a Counter may end up being quite a bit faster. 我不知道原始的csv文件是什么样子，但是按行将其作为文本文件读取，忽略第一个标记并将其余的内容输入到Counter可能最终会快很多。

Answer 5

So all the answers above were helpful, but didn't actually work! 因此，以上所有答案都是有帮助的，但实际上没有用！ The problem with my data is: 1)the value of 'hashtags' filed for some tweets are nan or [] . 我的数据存在的问题是：1）对于某些推文提交的'hashtags'的值是nan或[] 。 2)The value of 'hashtags' field in the dataframe is one string! 2）数据帧中'hashtags'字段的值是一个字符串！ the answers above assumed that the values of the hashtags are lists of hashtag, eg ['trump', 'clinton'] , while it actually is only an str : '[trump, clinton]' . 以上答案假定主题标签的值是主题标签列表，例如['trump', 'clinton'] ，而实际上只是一个str ： '[trump, clinton]' 。 So I added some lines to @jpp 's answer: 所以我在@jpp的答案中添加了几行：

#deleting rows with nan or '[]' values for in column hashtags 
df = df[df.hashtags != '[]']
df.dropna(subset=['hashtags'], inplace=True)

#changing each hashtag from str to list
df.hashtags = df.hashtags.str.strip('[')
df.hashtags = df.hashtags.str.strip(']')
df.hashtags = df.hashtags.str.split(', ')

from collections import Counter
from itertools import chain

c = Counter(chain.from_iterable(df['hashtags'].values.tolist()))

res = pd.DataFrame(c.most_common())\
        .set_axis(['Hashtag', 'count'], axis=1, inplace=False)

print(res)

遍历大熊猫中的行并计算唯一的主题标签

问题描述

5 个解决方案

解决方案1
2 2018-11-29 02:11:05

`Counter` + `chain` `Counter` + `chain`

解决方案2
2 2018-11-29 02:27:04

解决方案3
2 2018-11-29 02:28:58

解决方案4
1 2018-11-29 02:20:00

解决方案5
0 已采纳 2018-11-29 20:06:25

遍历大熊猫中的行并计算唯一的主题标签

问题描述

5 个解决方案

解决方案1 2 2018-11-29 02:11:05

Counter + chain Counter + chain

解决方案2 2 2018-11-29 02:27:04

解决方案3 2 2018-11-29 02:28:58

解决方案4 1 2018-11-29 02:20:00

解决方案5 0 已采纳 2018-11-29 20:06:25

解决方案1
2 2018-11-29 02:11:05

`Counter` + `chain` `Counter` + `chain`

解决方案2
2 2018-11-29 02:27:04

解决方案3
2 2018-11-29 02:28:58

解决方案4
1 2018-11-29 02:20:00

解决方案5
0 已采纳 2018-11-29 20:06:25