How to delete the words between two delimiters?

Question

I have a noisy data..something like

<@ """@$ FSDF >something something <more noise>

Now I just want to extract "something something" . Is there a way on how to delete the text between those two delimiters "<" and ">" ?

Answer 1

Use regular expressions :

>>> import re
>>> s = '<@ """@$ FSDF >something something <more noise>'
>>> re.sub('<[^>]+>', '', s)
'something something '

[Update]

If you tried a pattern like <.+> , where the dot means any character and the plus sign means one or more, you know it does not work.

>>> re.sub(r'<.+>', s, '')
''

Why!?! It happens because regular expressions are "greedy" by default. The expression will match anything until the end of the string, including the > - and this is not what we want. We want to match < and stop on the next > , so we use the [^x] pattern which means "any character but x" (x being > ).

The ? operator turns the match "non-greedy", so this has the same effect:

>>> re.sub(r'<.+?>', '', s)
'something something '

The previous is more explicit, this one is less typing; be aware that x? means zero or one occurrence of x.

Answer 2

Of course, you can use regular expressions.

import re
s = #your string here
t = re.sub('<.*?>', '', s)

The above code should do it.

Answer 3

First thank you Paulo Scardine, I used your re to do great thing. The idea was to have tag free LibreOffice po file for printing purposes. And I made the following script which will clean the help file for smaller and easier ones.

import re
f = open('a.csv')
text = f.read()
f.close()

clean = re.sub('<[^>]+>', ' ', text)

f = open('b.csv', 'w')
f.write(clean)
f.close()

Answer 4

import re
my_str = '<@ """@$ FSDF >something something <more noise>'
re.sub('<.*?>', '', my_str)
'something something '

The re.sub function takes a regular expresion and replace all the matches in the string with the second parameter. In this case, we are searching for all characters between < and > ( '<.*?>' ) and replacing them with nothing ( '' ).

The ? is used in re for non-greedy searches.

More about the re module .

If that "noises" are actually html tags, I suggest you to look into BeautifulSoup

Answer 5

Just for interest, you could write some code such as:

with open('blah.txt','w') as f:
    f.write("""<sdgsa>one<as<>asfd<asdf>
<asdf>two<asjkdgai><iasj>three<fasdlojk>""")

def filter_line(line):
    count=0
    ignore=False
    result=[]
    for c in line:
        if c==">" and count==1:
            count=0
            ignore=False
        if not ignore:
            result.append(c)
        if c=="<" and count==0:
            ignore=True
            count=1
    return "".join(result)

with open('blah.txt') as f:
    print "".join(map(filter_line,f.readlines()))

>>> 
<>one<>asfd<>
<>two<><>three<>

How to delete the words between two delimiters?

Question

5 answers

solution1
52 ACCPTED 2012-01-09 05:55:25

solution2
14 2012-01-09 05:56:05

solution3
5 2013-01-19 19:03:24

solution4
3 2012-01-09 05:57:33

solution5
1 2012-01-09 06:07:17

How to delete the words between two delimiters?

Question

5 answers

solution1 52 ACCPTED 2012-01-09 05:55:25

solution2 14 2012-01-09 05:56:05

solution3 5 2013-01-19 19:03:24

solution4 3 2012-01-09 05:57:33

solution5 1 2012-01-09 06:07:17

solution1
52 ACCPTED 2012-01-09 05:55:25

solution2
14 2012-01-09 05:56:05

solution3
5 2013-01-19 19:03:24

solution4
3 2012-01-09 05:57:33

solution5
1 2012-01-09 06:07:17