How to remove text between two specific words in a dataframe column by python

Question

I have text in a column I am cleaning, I need to remove all words between the words "Original" and "Subject" wherever they appear in the column which is only some of the rows.

I am currently trying

   a = df['textcol']
   import re
   df['textcol'] =re.sub('Original.*?Subject','',str(a), flags=re.DOTALL)

this function is making every string within every tow the exact same as the first row it instead of looking at each row individually and altering it

Answer 1

You need to use Series.str.replace directly:

df['textcol'] = df['textcol'].str.replace(r'(?s)Original.*?Subject', '', regex=True)

Here, (?s) stands for re.DOTALL / re.S in order not to have to import re , it is their inline modifier version. The .*? matches any zero or more chars, as few as possible.

If Original and Subject need to be passed as variables containing literal text, do not forget about re.escape :

import re
# ... etc. ...
start = "Original"
end = "Subject"
df['textcol'] = df['textcol'].str.replace(fr'(?s){re.escape(start)}.*?{re.escape(end)}', '', regex=True)

How to remove text between two specific words in a dataframe column by python

Question

1 answers

solution1
4 ACCPTED 2021-10-06 21:12:42

How to remove text between two specific words in a dataframe column by python

Question

1 answers

solution1 4 ACCPTED 2021-10-06 21:12:42

solution1
4 ACCPTED 2021-10-06 21:12:42