简体   繁体   中英

How to remove characters with special strings using regular expression in Python?

I am trying to clean up a log and I want to remove some special strings

Example:

%/h >  %/h Current value over threshold value
Pg/S >  Pg/S Current value over threshold value
Pg/S >  Pg/S  No. of pages paged in exceeds threshold
MB <  MB   min. avg. value over threshold value

I have tried to use some patterns but it seems not to work.

re.sub(r'\w\w\/\s>\s\w','',text)

Is there any good idea for me to remove the special pattern?

I want to remove the .../...>.../...

I expect my output to only contain useful words.

   Current value over threshold value
   No. of pages paged in exceeds threshold
   min. avg. value over threshold value

Thank you for any idea!

Based on the pattern you are trying to match on, it seems like you always know where the string is positioned. You can actually do this without regex, and just make use of split and slicing to get the section of interest. Finally, use join to bring back in to a string , for your final result.

The below result will do the following:

s.split() - split on space creating a list where each words will be an entry in the list

[3:] - slice the list by taking everything from the fourth position (0 indexing)

' '.join() - Will convert back to a string, placing a space between each element from the list

Demo:

s = "%/h >  %/h Current value over threshold value"
res = ' '.join(s.split()[3:])

Output:

Current value over threshold value

Assuming the structure of the file is:

[special-string] [< or >] [special-string] [message]

then this should work:

>>> rgx = re.compile(r'^[^<>]+[<>] +\S+ +', re.M)
>>>
>>> s = """
... %/h >  %/h Current value over threshold value
... Pg/S >  Pg/S Current value over threshold value
... Pg/S >  Pg/S  No. of pages paged in exceeds threshold
... MB <  MB   min. avg. value over threshold value
... """
>>>
>>> print(rgx.sub('', s))
Current value over threshold value
Current value over threshold value
No. of pages paged in exceeds threshold
min. avg. value over threshold value

This is a relatively long regex, but it gets the job done.

[%\w][\/\w]\/?[\/\s\w]\s?\<?\>?\s\s[\w%]\/?[a-zA-Z%]\/?[\w]?\s\s?\s?

Demo: https://regex101.com/r/ayh19b/4

Or you can do something like:

^[\s\S]*?(?=\w\w(?:\w|\.))

Demo: https://regex101.com/r/ayh19b/6

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM