使用 langdetect 计算 Pandas 数据框中的语言频率

Question

I want to find the frequency of different languages in a tweet dataset.我想在推文数据集中找到不同语言的频率。 I eventually only want to use the tweets that are in English but want to find out the frequencies of other languages as well.我最终只想使用英文推文，但也想找出其他语言的频率。

I've detected the language of tweets within my dataset using langdetect, and now I want to count the frequency of each language.我已经使用 langdetect 在我的数据集中检测到推文的语言，现在我想计算每种语言的频率。 This is my code for detecting the language:这是我检测语言的代码：

from langdetect import detect    
import pandas as pd
data_path = "./output_1.csv"
df =  pd.read_csv(data_path, index_col=0)

for index, row in df.iterrows():
    print(detect(row['text']))
    if detect(row['text']) == "en":
        print(row['text'])

I wanted to use list property count to count the frequencies:我想使用列表属性计数来计算频率：

using the list i = ['en','fr','es','it','cs','pt']
d = {x:i.count(x) for x in i}
print d

How do I use the count property on the data that I got using langdetect?如何对使用 langdetect 获得的数据使用 count 属性？

Answer 1

To create a separate column containing the language you could do:要创建包含您可以执行的语言的单独列：

df['language'] = df['text'].apply(lambda x: detect(x))

Then to count the frequency you could do:然后计算您可以执行的频率：

pd.DataFrame(df.groupby('language').text.count().sort_values(ascending=False))

使用 langdetect 计算 Pandas 数据框中的语言频率

问题描述

1 个解决方案

解决方案1
1 2019-12-09 09:38:30

使用 langdetect 计算 Pandas 数据框中的语言频率

问题描述

1 个解决方案

解决方案1 1 2019-12-09 09:38:30

解决方案1
1 2019-12-09 09:38:30