简体   繁体   English

使用 langdetect 计算 Pandas 数据框中的语言频率

[英]Counting language frequencies in a pandas data frame using langdetect

I want to find the frequency of different languages in a tweet dataset.我想在推文数据集中找到不同语言的频率。 I eventually only want to use the tweets that are in English but want to find out the frequencies of other languages as well.我最终只想使用英文推文,但也想找出其他语言的频率。

I've detected the language of tweets within my dataset using langdetect, and now I want to count the frequency of each language.我已经使用 langdetect 在我的数据集中检测到推文的语言,现在我想计算每种语言的频率。 This is my code for detecting the language:这是我检测语言的代码:

from langdetect import detect    
import pandas as pd
data_path = "./output_1.csv"
df =  pd.read_csv(data_path, index_col=0)

for index, row in df.iterrows():
    print(detect(row['text']))
    if detect(row['text']) == "en":
        print(row['text'])

I wanted to use list property count to count the frequencies:我想使用列表属性计数来计算频率:

using the list i = ['en','fr','es','it','cs','pt']
d = {x:i.count(x) for x in i}
print d

How do I use the count property on the data that I got using langdetect?如何对使用 langdetect 获得的数据使用 count 属性?

To create a separate column containing the language you could do:要创建包含您可以执行的语言的单独列:

df['language'] = df['text'].apply(lambda x: detect(x))

Then to count the frequency you could do:然后计算您可以执行的频率:

pd.DataFrame(df.groupby('language').text.count().sort_values(ascending=False))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM