Remove all numbers except for the ones combined to string using python regex

Question

Trying to use a regex function to remove a word, whitespaces, special characters and numbers but not the one combined with to a word/string. Eg

ORIGIN
    1 malwmrllp1 lallalwgpd paaafvnghl cgshlvealy lvcgergffy tpktrreaed
    61 lqvgqvelgg gpgagslqpl alegslqkrg iveqcctsic slyqlenycn

//

The \W+ removes all numbers including 1 in malwmrll1

import re

text_file = open('mytext.txt').read()
new_txt = re.sub('[\\b\\d+\\b\s*$+\sORIGIN$\W+]', '', text_file)

print(new_txt, len(new_txt))

My output is:

malwmrllplallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 109

The desired output should be: malwmrll1plallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 110

Answer 1

Right, depending on your desired result showing underscores at all or not, try to use re.findall and raw-string notation. You currently use a character class that makes no sense:

\b(?!(?:ORIGIN|[_\d]+)\b)\w+

See an online demo

\b - Word-boundary;
(?!(?:ORIGIN|[_\d]+)\b) - Negative lookahead with nested non-capture group to match either ORIGIN or 1+ underscore/digit combinations before a trailing word-boundary;
\w+ - 1+ word-characters.

import re
  
text_file = """ORIGIN
    1 malwmrllp1 lallalwgpd paaafvnghl cgshlvealy lvcgergffy tpktrreaed
    61 lqvgqvelgg gpgagslqpl alegslqkrg iveqcctsic slyqlenycn

//"""

new_txt=''.join(re.findall(r'\b(?!(?:ORIGIN|[_\d]+)\b)\w+', text_file))    
print(new_txt, len(new_txt))

Prints:

malwmrllp1lallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 110

Answer 2

Using RE for this is an interesting academic exercise but extending the functionality is fraught with danger unless one is very familiar with the technique.

This answer may look long-winded but you should be able to see how easy it would be to extend it so that other tokens/patterns can be excluded or included. It's also readily maintainable because anyone else having to modify the code isn't going to get a migraine while trying to figure out how the RE works.

FILENAME = 'mytext.txt'

def keep(t):
    if t.isdigit() or t == 'ORIGIN' or t == '//':
        return False
    return True

with open(FILENAME) as f:
    new_txt = ''.join(filter(keep, f.read().split()))
    print(new_txt, len(new_txt))

Output:

malwmrllp1lallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 110

Answer 3

Another idea:

new_txt = re.sub('[\\W_]+|\\b(?:\\d+|ORIGIN)\\b', '', text_file)

Strip out all non word characters + underscore OR digits / "ORIGIN" between word boundaries .

See this demo at tio.run (the regex is very basic, explanation at regex101 )

Remove all numbers except for the ones combined to string using python regex

Question

3 answers

solution1
1 ACCPTED 2022-06-02 07:10:10

solution2
1 2022-06-02 07:30:04

solution3
1 2022-06-02 07:47:44

Remove all numbers except for the ones combined to string using python regex

Question

3 answers

solution1 1 ACCPTED 2022-06-02 07:10:10

solution2 1 2022-06-02 07:30:04

solution3 1 2022-06-02 07:47:44

solution1
1 ACCPTED 2022-06-02 07:10:10

solution2
1 2022-06-02 07:30:04

solution3
1 2022-06-02 07:47:44