如果Pandas系列中的字符串包含單詞中的單詞，則最快的方法

Question

我有一個包含近300萬行的大型數據集all_transcripts 。 其中一列msgText包含書面消息。

>>> all_transcripts['msgText']

['this is my first message']
['second message is here']
['this is my third message']

此外，我有一個200多個單詞的列表，稱為gemeentes 。

>>> gemeentes
['first','second','third' ... ]

如果此列表中的單詞包含在msgText ，我想用另一個單詞替換它。 為此，我創建了這個函數：

def replaceCity(text):
    newText = text.replace(plaatsnaam, 'woonplaats')
    return str(newText)

所以，我想要的輸出看起來像：

['this is my woonplaats message']
['woonplaats message is here']
['this is my woonplaats message']

目前，我循環遍歷列表，對於列表中的每個項目，應用replaceCity函數。

for plaatsnaam in gemeentes:
    global(plaatsnaam)
    all_transcripts['filtered_text'] = test.msgText.apply(replaceCity)

但是，這需要很長時間，因此似乎效率不高。 有沒有更快的方法來執行此任務？

這篇文章（找到多個字符串匹配的算法）是類似的，但我的問題是不同的，因為：

這里只有一小段文字，而我有一個包含許多不同行的數據集
我想替換單詞，而不是僅僅找到單詞。

Answer 1

假設all_transcripts是熊貓DataFrame ：

all_transcripts['msgText'].str.replace('|'.join(gemeentes),'woonplaats')

例：

all_transcripts = pd.DataFrame([['this is my first message'],
                                ['second message is here'],
                                ['this is my third message']],
                               columns=['msgText'])
gemeentes = ['first','second','third']

all_transcripts['msgText'].str.replace('|'.join(gemeentes),'woonplaats')

輸出

0    this is my woonplaats message
1       woonplaats message is here
2    this is my woonplaats message

如果Pandas系列中的字符串包含單詞中的單詞，則最快的方法

問題描述

1 個解決方案

解決方案1
2 已采納 2019-05-01 10:08:40

如果Pandas系列中的字符串包含單詞中的單詞，則最快的方法

問題描述

1 個解決方案

解決方案1 2 已采納 2019-05-01 10:08:40

解決方案1
2 已采納 2019-05-01 10:08:40