在 dataframe 中使用 str.contains 和正則表達式搜索單詞很慢，有沒有更好的方法？

Question

我有一個超過 200 萬行的數據庫。 我正在嘗試使用正則表達式查找包含兩個單詞的行，例如：

df1 = df[df['my_column'].str.contains(r'(?=.*first_word)(?=.*second_word)')]

但是，當嘗試在 jupyter notebook 中處理這個問題時，返回這些行需要一分鍾多的時間，或者它會使內核崩潰，我必須再試一次。

有沒有更有效的方法來返回包含兩個單詞的 dataframe 中的行？

Answer 1

利用

df['my_column'].apply(lambda x: all(l in x for l in ['first_word', 'second_word']) )

它將確保列表中的單詞都出現在my_column列中，而沒有尷尬的正則表達式。