简体   繁体   中英

Python regex to find and regex to remove from list

I built this little RSS reader a while ago for myself and I felt inspired to update it to exclude junk from description tag's. Im busy testing it out now to remove &'lt; (all content) &'gt; from the description tags and Im having trouble getting this rite.

So far my code looks something like this

from re import findall
from Tkinter import *
from urllib import urlopen

disc = []
URL = 'http://feeds.sciencedaily.com/sciencedaily/matter_energy/engineering?format=xml'
O_W = urlopen(URL).read()

disc_ex = findall('<description>(.*)</description>',O_W)
for i in disc_ex:
    new_disc = i.replace(findall('&lt;(.*)&gt;',i),'')
    disc.extend([new_disc])

So prior to the new_disc line of code on my attempt to remove some of the rubbish text I would normally get my text to come through looking like this

"Tailored DNA structures could find targeted cells and release their molecular payload selectively into the cells.&lt;img src="http://feeds.feedburner.com/~r/sciencedaily/matter_energy/engineering/~4/J1bTggGxFOY" height="1" width="1" alt=""/&gt;"

What I want is just the text without the rubbish, so essentially just:

"Tailored DNA structures could find targeted cells and release their molecular payload selectively into the cells."

Any suggestions for me?

There are several solutions, BeautifulSoup for example. To follow your idea, avoid strings within '<' ...'>' brackets just change last line:

...
for i in disc_ex:
    new_disc = i.replace(findall('&lt;(.*)&gt;',i),'')
    disc.extend([re.sub(r'<(.*)/>','',new_disc)])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM