Removing a lot of rows from a dataframe in Python

Question

I am trying to remove non-English tweets from a large dataset in the most efficient way possible. I have tried to create a list of rows that are not English and them removing them, but removing each tweet takes a long time (the langid.classify() function is not the problem).

def removeLanguage(df):
  rowsToDelete = []
  text = df['tweet'][i]
  try:
    if (langid.classify(text)[0] != 'en' ):
      rowsToDelete.append(i)

      continue
  except ValueError:
    rowsToDelete.append(i)
    continue
   
  for i in rowsToDelete:
    df.drop(i, inplace=True)

newDf = beforeClassification(inputDf).reset_index(drop=True)

Is there a more efficient way to remove a set of rows from a DataFrame than df.drop() ?

Answer 1

df.drop非常有效，但我也会使用类似的东西

df = df[langid.classify(df.tweet)[0] != 'en' ]

Removing a lot of rows from a dataframe in Python

Question

1 answers

solution1
0 2021-11-14 06:15:49

Removing a lot of rows from a dataframe in Python

Question

1 answers

solution1 0 2021-11-14 06:15:49

solution1
0 2021-11-14 06:15:49