简体   繁体   中英

How to delete the words between two delimiters?

I have a noisy data..something like

<@ """@$ FSDF >something something <more noise>

Now I just want to extract "something something" . Is there a way on how to delete the text between those two delimiters "<" and ">" ?

Use regular expressions :

>>> import re
>>> s = '<@ """@$ FSDF >something something <more noise>'
>>> re.sub('<[^>]+>', '', s)
'something something '

[Update]

If you tried a pattern like <.+> , where the dot means any character and the plus sign means one or more, you know it does not work.

>>> re.sub(r'<.+>', s, '')
''

Why!?! It happens because regular expressions are "greedy" by default. The expression will match anything until the end of the string, including the > - and this is not what we want. We want to match < and stop on the next > , so we use the [^x] pattern which means "any character but x" (x being > ).

The ? operator turns the match "non-greedy", so this has the same effect:

>>> re.sub(r'<.+?>', '', s)
'something something '

The previous is more explicit, this one is less typing; be aware that x? means zero or one occurrence of x.

Of course, you can use regular expressions.

import re
s = #your string here
t = re.sub('<.*?>', '', s)

The above code should do it.

First thank you Paulo Scardine, I used your re to do great thing. The idea was to have tag free LibreOffice po file for printing purposes. And I made the following script which will clean the help file for smaller and easier ones.

import re
f = open('a.csv')
text = f.read()
f.close()

clean = re.sub('<[^>]+>', ' ', text)

f = open('b.csv', 'w')
f.write(clean)
f.close()
import re
my_str = '<@ """@$ FSDF >something something <more noise>'
re.sub('<.*?>', '', my_str)
'something something '

The re.sub function takes a regular expresion and replace all the matches in the string with the second parameter. In this case, we are searching for all characters between < and > ( '<.*?>' ) and replacing them with nothing ( '' ).

The ? is used in re for non-greedy searches.

More about the re module .


If that "noises" are actually html tags, I suggest you to look into BeautifulSoup

Just for interest, you could write some code such as:

with open('blah.txt','w') as f:
    f.write("""<sdgsa>one<as<>asfd<asdf>
<asdf>two<asjkdgai><iasj>three<fasdlojk>""")

def filter_line(line):
    count=0
    ignore=False
    result=[]
    for c in line:
        if c==">" and count==1:
            count=0
            ignore=False
        if not ignore:
            result.append(c)
        if c=="<" and count==0:
            ignore=True
            count=1
    return "".join(result)

with open('blah.txt') as f:
    print "".join(map(filter_line,f.readlines()))

>>> 
<>one<>asfd<>
<>two<><>three<>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM