re.sub greedy characters

Question

I would like to remove text from my strings that start with "\\" , such as:

 \xf, \africa\87, \ckat\x70, ...

Is there a way of doing this using greedy characters in re.sub ?

eg:

line = re.sub("[\.*]", "", line)

Thanks!

EDIT: input example:

" lorem ipsum \xe2\x80\x9csianhill7 lorem ipsum"

output:

" lorem ipsum lorem ipsum"

Answer 1

If I understand your question correctly, you want to remove all non-ascii prefixes words from your sentences

You can easily do it through a single pass LC with ordinal match and filter without employing regex

>>> data = " lorem ipsum \xe2\x80\x9csianhill7 lorem ipsum"
>>> ' '.join(e for e in data.split() if 31 < ord(e[0]) < 127)
'lorem ipsum lorem ipsum'

Answer 2

The expression to match is:

[\b\\][\w]+,?

And using your input text of lorem ipsum the above expression does match only the inner words you want to remove :)

example rubular

I have added some other regex to match , inside the string and used * for 0 or more after the \\

[\b\\][\w.,]*

another example

Answer 3

regex = re.compile(r"""
                    \\\S+\s*
                    """, re.VERBOSE)
line = r" lorem ipsum \xe2\x80\x9csianhill7 lorem ipsum"
replaced = regex.sub("", line)

Note that you need to tell python to treat the '\\' as regular character not as escape character. This is done by adding the r in front of the string.

I also assume that you want to remove all the text beginning with '\\' up to and including the next white-space characters.

re.sub greedy characters

Question

3 answers

solution1
3 ACCPTED 2013-01-21 17:01:14

solution2
1 2013-01-21 16:39:03

solution3
1 2013-01-21 16:58:19

re.sub greedy characters

Question

3 answers

solution1 3 ACCPTED 2013-01-21 17:01:14

solution2 1 2013-01-21 16:39:03

solution3 1 2013-01-21 16:58:19

solution1
3 ACCPTED 2013-01-21 17:01:14

solution2
1 2013-01-21 16:39:03

solution3
1 2013-01-21 16:58:19