简体   繁体   中英

How to remove text between two specific words in a dataframe column by python

I have text in a column I am cleaning, I need to remove all words between the words "Original" and "Subject" wherever they appear in the column which is only some of the rows.

I am currently trying

   a = df['textcol']
   import re
   df['textcol'] =re.sub('Original.*?Subject','',str(a), flags=re.DOTALL)

this function is making every string within every tow the exact same as the first row it instead of looking at each row individually and altering it

You need to use Series.str.replace directly:

df['textcol'] = df['textcol'].str.replace(r'(?s)Original.*?Subject', '', regex=True)

Here, (?s) stands for re.DOTALL / re.S in order not to have to import re , it is their inline modifier version. The .*? matches any zero or more chars, as few as possible.

If Original and Subject need to be passed as variables containing literal text, do not forget about re.escape :

import re
# ... etc. ...
start = "Original"
end = "Subject"
df['textcol'] = df['textcol'].str.replace(fr'(?s){re.escape(start)}.*?{re.escape(end)}', '', regex=True)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM