如何编写最有效的方法来为数据框python中的列添加值？

Question

I have one data frame df consisting of a 2 columns(word and meaning/definition of that word). 我有一个由两列组成的数据帧df （单词和该单词的含义/定义）。 I want to use the Collections.Counter object for each definition of a word and count the frequency of words occurring in the definition in the most pythonic way possible. 我想对单词的每个定义使用Collections.Counter对象，并以尽可能多的Python方式计算在定义中出现的单词的频率。

The traditional approach would be to iterate over the data frame using the iterrows() methods and do the computations. 传统方法是使用iterrows()方法遍历数据帧并进行计算。

Sample output 样品输出

 <table style="height: 59px;" border="True" width="340"> <tbody> <tr> <td>Word</td> <td>Meaning</td> <td>Word Freq</td> </tr> <tr> <td>Array</td> <td>collection of homogeneous datatype</td> <td>{'collection':1,'of':1....}</td> </tr> <tr> <td>&nbsp;</td> <td>&nbsp;</td> <td>&nbsp;</td> </tr> </tbody> </table>

Answer 1

I would take advantage of Pandas str accessor methods and do this 我会利用Pandas str访问器方法并做到这一点

from collections import Counter
Counter(df.definition.str.cat(sep=' ').split())

Some Test data 一些测试数据

df = pd.DataFrame({'word': ['some', 'words', 'yes'], 'definition': ['this is a definition', 'another definition', 'one final definition']})

print(df)
             definition   word
0  this is a definition   some
1    another definition  words
2  one final definition    yes

And then concatenating and splitting by space and using Counter 然后按空间串联和拆分并使用Counter

Counter(df.definition.str.cat(sep=' ').split())

Counter({'a': 1,
         'another': 1,
         'definition': 3,
         'final': 1,
         'is': 1,
         'one': 1,
         'this': 1})

Answer 2

Assuming that df has two columns 'word' and 'definition' , then you simply use the .map method with Counter on the definition series after splitting on space. 假设df有两列'word'和'definition' ，则在空间上分割后，只需将.map方法与Counter一起用于definition系列。 Then sum the result. 然后对结果求和。

from collections import Counter

def_counts = df.definition.map(lambda x: Counter(x.split()))
all_counts = def_counts.sum()

Answer 3

I intend for this answer to be useful but not the chosen answer. 我希望这个答案有用，但不是所选择的答案。 In fact, I'm only making an argument for Counter and @TedPetrou's answer. 实际上，我只是在为Counter和@TedPetrou的答案争论。

create large example of random words 创建随机单词的大型示例

a = np.random.choice(list(ascii_lowercase), size=(100000, 5))

definitions = pd.Series(
    pd.DataFrame(a).sum(1).values.reshape(-1, 10).tolist()).str.join(' ')

definitions.head()

0    hmwnp okuat sexzr jsxhh bdoyc kdbas nkoov moek...
1    iiuot qnlgs xrmss jfwvw pmogp vkrvl bygit qqon...
2    ftcap ihuto ldxwo bvvch zuwpp bdagx okhtt lqmy...
3    uwmcs nhmxa qeomd ptlbg kggxr hpclc kwnix rlon...
4    npncx lnors gyomb dllsv hyayw xdynr ctwvh nsib...
dtype: object

timing 定时
Counter is an order of 1000 times faster than fastest I could think of. Counter比我能想到的最快速度快1000倍。

如何编写最有效的方法来为数据框python中的列添加值？

问题描述

3 个解决方案

解决方案1
2 2017-01-06 19:02:09

解决方案2
0 已采纳 2017-01-06 19:00:33

解决方案3
0 2017-01-06 21:07:36

如何编写最有效的方法来为数据框python中的列添加值？

问题描述

3 个解决方案

解决方案1 2 2017-01-06 19:02:09

解决方案2 0 已采纳 2017-01-06 19:00:33

解决方案3 0 2017-01-06 21:07:36

解决方案1
2 2017-01-06 19:02:09

解决方案2
0 已采纳 2017-01-06 19:00:33

解决方案3
0 2017-01-06 21:07:36