简体   繁体   English

使用 langdetect 删除非英文文本

[英]Dropping non-English text with langdetect

I am trying to use langdetect to drop all the languages which are not English in my text.我正在尝试使用 langdetect 删除文本中所有非英语的语言。

def det(x):
    try:
        language = detect(x)
    except:
        language = 'Other'
    return language

df['langue'] = df['Tweet'].apply(det)
filtered_for_english = df.loc[df['langue'] == 'en']

The above code is what I have tried.上面的代码是我尝试过的。 It detects the language used in each tweet but does not drop the non-English tweets from my data frame.它会检测每条推文中使用的语言,但不会从我的数据框中删除非英语推文。

The resulting data frame:结果数据框:

0        es
1        es
2        es
3        en
4        en
         ..
14272    en
14273    en
14274    en
14275    it
14276    en
Name: langue, Length: 14277, dtype: object

How can I fix this code?如何修复此代码?

This solution worked well for me.这个解决方案对我来说效果很好。

from langdetect import detect

def detect_english(text):
try:
    return detect(text) == 'en'
except:
    return False

Pass the pandas dataframe like the following to eliminate non-English textual data from the dataframe.像下面这样传递 pandas dataframe 以消除 dataframe 中的非英语文本数据。

df = df[df['text'].apply(detect_english)]

I had 5000 samples and the above implementation removed some and returned 4721 English textual data.我有 5000 个样本,上面的实现删除了一些并返回了 4721 个英文文本数据。

Note: Direct import in colab never worked for me.注意:colab 中的直接导入对我没有用。 I had to do !pip install langdetect我必须做!pip install langdetect

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM