[英]How to write most efficient way to add a value for an column in dataframe python?
I have one data frame df
consisting of a 2 columns(word and meaning/definition of that word). 我有一个由两列组成的数据帧
df
(单词和该单词的含义/定义)。 I want to use the Collections.Counter
object for each definition of a word and count the frequency of words occurring in the definition in the most pythonic way possible. 我想对单词的每个定义使用
Collections.Counter
对象,并以尽可能多的Python方式计算在定义中出现的单词的频率。
The traditional approach would be to iterate over the data frame using the iterrows()
methods and do the computations. 传统方法是使用
iterrows()
方法遍历数据帧并进行计算。
Sample output 样品输出
<table style="height: 59px;" border="True" width="340"> <tbody> <tr> <td>Word</td> <td>Meaning</td> <td>Word Freq</td> </tr> <tr> <td>Array</td> <td>collection of homogeneous datatype</td> <td>{'collection':1,'of':1....}</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table>
I would take advantage of Pandas str
accessor methods and do this 我会利用Pandas
str
访问器方法并做到这一点
from collections import Counter
Counter(df.definition.str.cat(sep=' ').split())
Some Test data 一些测试数据
df = pd.DataFrame({'word': ['some', 'words', 'yes'], 'definition': ['this is a definition', 'another definition', 'one final definition']})
print(df)
definition word
0 this is a definition some
1 another definition words
2 one final definition yes
And then concatenating and splitting by space and using Counter 然后按空间串联和拆分并使用Counter
Counter(df.definition.str.cat(sep=' ').split())
Counter({'a': 1,
'another': 1,
'definition': 3,
'final': 1,
'is': 1,
'one': 1,
'this': 1})
Assuming that df
has two columns 'word'
and 'definition'
, then you simply use the .map
method with Counter
on the definition
series after splitting on space. 假设
df
有两列'word'
和'definition'
,则在空间上分割后,只需将.map
方法与Counter
一起用于definition
系列。 Then sum the result. 然后对结果求和。
from collections import Counter
def_counts = df.definition.map(lambda x: Counter(x.split()))
all_counts = def_counts.sum()
I intend for this answer to be useful but not the chosen answer. 我希望这个答案有用,但不是所选择的答案。 In fact, I'm only making an argument for
Counter
and @TedPetrou's answer. 实际上,我只是在为
Counter
和@TedPetrou的答案争论。
create large example of random words 创建随机单词的大型示例
a = np.random.choice(list(ascii_lowercase), size=(100000, 5))
definitions = pd.Series(
pd.DataFrame(a).sum(1).values.reshape(-1, 10).tolist()).str.join(' ')
definitions.head()
0 hmwnp okuat sexzr jsxhh bdoyc kdbas nkoov moek...
1 iiuot qnlgs xrmss jfwvw pmogp vkrvl bygit qqon...
2 ftcap ihuto ldxwo bvvch zuwpp bdagx okhtt lqmy...
3 uwmcs nhmxa qeomd ptlbg kggxr hpclc kwnix rlon...
4 npncx lnors gyomb dllsv hyayw xdynr ctwvh nsib...
dtype: object
timing 定时
Counter
is an order of 1000
times faster than fastest I could think of. Counter
比我能想到的最快速度快1000
倍。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.