使用 langdetect 删除非英文文本

Question

I am trying to use langdetect to drop all the languages which are not English in my text.我正在尝试使用 langdetect 删除文本中所有非英语的语言。

def det(x):
    try:
        language = detect(x)
    except:
        language = 'Other'
    return language

df['langue'] = df['Tweet'].apply(det)
filtered_for_english = df.loc[df['langue'] == 'en']

The above code is what I have tried.上面的代码是我尝试过的。 It detects the language used in each tweet but does not drop the non-English tweets from my data frame.它会检测每条推文中使用的语言，但不会从我的数据框中删除非英语推文。

The resulting data frame:结果数据框：

0        es
1        es
2        es
3        en
4        en
         ..
14272    en
14273    en
14274    en
14275    it
14276    en
Name: langue, Length: 14277, dtype: object

How can I fix this code?如何修复此代码？

Answer 1

This solution worked well for me.这个解决方案对我来说效果很好。

from langdetect import detect

def detect_english(text):
try:
    return detect(text) == 'en'
except:
    return False

Pass the pandas dataframe like the following to eliminate non-English textual data from the dataframe.像下面这样传递 pandas dataframe 以消除 dataframe 中的非英语文本数据。

df = df[df['text'].apply(detect_english)]

I had 5000 samples and the above implementation removed some and returned 4721 English textual data.我有 5000 个样本，上面的实现删除了一些并返回了 4721 个英文文本数据。

Note: Direct import in colab never worked for me.注意：colab 中的直接导入对我没有用。 I had to do !pip install langdetect我必须做!pip install langdetect

使用 langdetect 删除非英文文本

问题描述

1 个解决方案

解决方案1
0 2022-01-30 11:33:29

使用 langdetect 删除非英文文本

问题描述

1 个解决方案

解决方案1 0 2022-01-30 11:33:29

解决方案1
0 2022-01-30 11:33:29