I am trying to remove non-English tweets from a large dataset in the most efficient way possible. I have tried to create a list of rows that are not English and them removing them, but removing each tweet takes a long time (the langid.classify()
function is not the problem).
def removeLanguage(df):
rowsToDelete = []
text = df['tweet'][i]
try:
if (langid.classify(text)[0] != 'en' ):
rowsToDelete.append(i)
continue
except ValueError:
rowsToDelete.append(i)
continue
for i in rowsToDelete:
df.drop(i, inplace=True)
newDf = beforeClassification(inputDf).reset_index(drop=True)
Is there a more efficient way to remove a set of rows from a DataFrame than df.drop()
?
df.drop
非常有效,但我也会使用类似的东西
df = df[langid.classify(df.tweet)[0] != 'en' ]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.