[英]Dropping non-English text with langdetect
I am trying to use langdetect to drop all the languages which are not English in my text.我正在尝试使用 langdetect 删除文本中所有非英语的语言。
def det(x):
try:
language = detect(x)
except:
language = 'Other'
return language
df['langue'] = df['Tweet'].apply(det)
filtered_for_english = df.loc[df['langue'] == 'en']
The above code is what I have tried.上面的代码是我尝试过的。 It detects the language used in each tweet but does not drop the non-English tweets from my data frame.
它会检测每条推文中使用的语言,但不会从我的数据框中删除非英语推文。
The resulting data frame:结果数据框:
0 es
1 es
2 es
3 en
4 en
..
14272 en
14273 en
14274 en
14275 it
14276 en
Name: langue, Length: 14277, dtype: object
How can I fix this code?如何修复此代码?
This solution worked well for me.这个解决方案对我来说效果很好。
from langdetect import detect
def detect_english(text):
try:
return detect(text) == 'en'
except:
return False
Pass the pandas dataframe like the following to eliminate non-English textual data from the dataframe.像下面这样传递 pandas dataframe 以消除 dataframe 中的非英语文本数据。
df = df[df['text'].apply(detect_english)]
I had 5000 samples and the above implementation removed some and returned 4721 English textual data.我有 5000 个样本,上面的实现删除了一些并返回了 4721 个英文文本数据。
Note: Direct import in colab never worked for me.注意:colab 中的直接导入对我没有用。 I had to do
!pip install langdetect
我必须做
!pip install langdetect
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.