The code I am working on is retrieving a list from an HTML page with 2 fields, URL, and title...
The URL anyway starts with /URL....
And I need to append the " http://website.com " to every returned vauled from a re.findall
.
The code so far is this:
bsoup=bs(html)
tag=soup.find('div',{'class':'item'})
reg=re.compile('<a href="(.+?)" rel=".+?" title="(.+?)"')
links=re.findall(reg,str(tag))
*(append "http://website.com" to the href"(.+?)" field)*
return links
Try:
for link in tag.find_all('a'):
link['href'] = 'http://website.com' + link['href']
Then use one of these output methods:
return str(soup)
gets you the document after the changes are applied.
return tag.find_all('a')
gets you all the link elements.
return [str(i) for i in tag.find_all('a')]
gets you all the link elements converted to strings.
Now, don't try to parse HTML with regex while you have a XML parser already working.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.