简体   繁体   中英

re.sub greedy characters

I would like to remove text from my strings that start with "\\" , such as:

 \xf, \africa\87, \ckat\x70, ...

Is there a way of doing this using greedy characters in re.sub ?

eg:

line = re.sub("[\.*]", "", line)

Thanks!

EDIT: input example:

" lorem ipsum \xe2\x80\x9csianhill7 lorem ipsum"

output:

" lorem ipsum lorem ipsum"

If I understand your question correctly, you want to remove all non-ascii prefixes words from your sentences

You can easily do it through a single pass LC with ordinal match and filter without employing regex

>>> data = " lorem ipsum \xe2\x80\x9csianhill7 lorem ipsum"
>>> ' '.join(e for e in data.split() if 31 < ord(e[0]) < 127)
'lorem ipsum lorem ipsum'

The expression to match is:

[\b\\][\w]+,?

And using your input text of lorem ipsum the above expression does match only the inner words you want to remove :)

example rubular

I have added some other regex to match , inside the string and used * for 0 or more after the \\

[\b\\][\w.,]*

another example

regex = re.compile(r"""
                    \\\S+\s*
                    """, re.VERBOSE)
line = r" lorem ipsum \xe2\x80\x9csianhill7 lorem ipsum"
replaced = regex.sub("", line)

Note that you need to tell python to treat the '\\' as regular character not as escape character. This is done by adding the r in front of the string.

I also assume that you want to remove all the text beginning with '\\' up to and including the next white-space characters.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM