I would like to remove text from my strings that start with "\\"
, such as:
\xf, \africa\87, \ckat\x70, ...
Is there a way of doing this using greedy characters in re.sub
?
eg:
line = re.sub("[\.*]", "", line)
Thanks!
EDIT: input example:
" lorem ipsum \xe2\x80\x9csianhill7 lorem ipsum"
output:
" lorem ipsum lorem ipsum"
If I understand your question correctly, you want to remove all non-ascii prefixes words from your sentences
You can easily do it through a single pass LC
with ordinal match
and filter
without employing regex
>>> data = " lorem ipsum \xe2\x80\x9csianhill7 lorem ipsum"
>>> ' '.join(e for e in data.split() if 31 < ord(e[0]) < 127)
'lorem ipsum lorem ipsum'
The expression to match is:
[\b\\][\w]+,?
And using your input text of lorem ipsum the above expression does match only the inner words you want to remove :)
I have added some other regex to match ,
inside the string and used *
for 0 or more after the \\
[\b\\][\w.,]*
regex = re.compile(r"""
\\\S+\s*
""", re.VERBOSE)
line = r" lorem ipsum \xe2\x80\x9csianhill7 lorem ipsum"
replaced = regex.sub("", line)
Note that you need to tell python to treat the '\\' as regular character not as escape character. This is done by adding the r in front of the string.
I also assume that you want to remove all the text beginning with '\\' up to and including the next white-space characters.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.