I have a large dataset all_transcripts
with almost 3 million rows. One of the columns msgText
contains written messages.
>>> all_transcripts['msgText']
['this is my first message']
['second message is here']
['this is my third message']
Furthermore, I have a list with 200+ words, called gemeentes
.
>>> gemeentes
['first','second','third' ... ]
If a word in this list is contained in msgText
, I want to replace it by another word. To do so, I created the function:
def replaceCity(text):
newText = text.replace(plaatsnaam, 'woonplaats')
return str(newText)
So, my desired output would look like:
['this is my woonplaats message']
['woonplaats message is here']
['this is my woonplaats message']
Currently, I am looping through the list and for every item in my list, apply the replaceCity
function.
for plaatsnaam in gemeentes:
global(plaatsnaam)
all_transcripts['filtered_text'] = test.msgText.apply(replaceCity)
However, this takes very long, so does not seem to be efficient. Is there a faster way to perform this task?
This post ( Algorithm to find multiple string matches ) is similar, however my problem is different because:
here there is only one big piece small of text, while I have a dataset with many different rows
I want to replace words, rather than just finding the words.
Assuming all_transcripts
is a pandas DataFrame
:
all_transcripts['msgText'].str.replace('|'.join(gemeentes),'woonplaats')
Example:
all_transcripts = pd.DataFrame([['this is my first message'],
['second message is here'],
['this is my third message']],
columns=['msgText'])
gemeentes = ['first','second','third']
all_transcripts['msgText'].str.replace('|'.join(gemeentes),'woonplaats')
outputs
0 this is my woonplaats message
1 woonplaats message is here
2 this is my woonplaats message
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.