简体   繁体   中英

Fastest way to replace part of a string in Pandas series if it contains a word in a list

I have a large dataset all_transcripts with almost 3 million rows. One of the columns msgText contains written messages.

>>> all_transcripts['msgText']

['this is my first message']
['second message is here']
['this is my third message']

Furthermore, I have a list with 200+ words, called gemeentes .

>>> gemeentes
['first','second','third' ... ]

If a word in this list is contained in msgText , I want to replace it by another word. To do so, I created the function:

def replaceCity(text):
    newText = text.replace(plaatsnaam, 'woonplaats')
    return str(newText)

So, my desired output would look like:

['this is my woonplaats message']
['woonplaats message is here']
['this is my woonplaats message']

Currently, I am looping through the list and for every item in my list, apply the replaceCity function.

for plaatsnaam in gemeentes:
    global(plaatsnaam)
    all_transcripts['filtered_text'] = test.msgText.apply(replaceCity)

However, this takes very long, so does not seem to be efficient. Is there a faster way to perform this task?


This post ( Algorithm to find multiple string matches ) is similar, however my problem is different because:

  • here there is only one big piece small of text, while I have a dataset with many different rows

  • I want to replace words, rather than just finding the words.

Assuming all_transcripts is a pandas DataFrame :

all_transcripts['msgText'].str.replace('|'.join(gemeentes),'woonplaats')

Example:

all_transcripts = pd.DataFrame([['this is my first message'],
                                ['second message is here'],
                                ['this is my third message']],
                               columns=['msgText'])
gemeentes = ['first','second','third']

all_transcripts['msgText'].str.replace('|'.join(gemeentes),'woonplaats')

outputs

0    this is my woonplaats message
1       woonplaats message is here
2    this is my woonplaats message

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM