Recently I've been working on a scraping project. I'm kinda new to it, but could manage to do almost everything, but I'm having trouble with a little issue. I captured every line of a news article doing this:
lines=bs.find('div',{'class':'Text'}).find_all('div')
But for some reason, there's some lines that contain an h2
tag and a br
tag, like this one:
<div><div><h2>Header2</h2></div><div><br/></div><div>Paragraph text
So if I run .text
on that snippet I get "Header2Paragraph text". I've got the "Header2" text stored in other line, so I want to delete this second apparition.
I managed to isolate those lines doing this:
for n,t in enumerate(lines):
if t.find('h2') is not None and t.find('br') is not None:
print('\n',n,':',t)
But I don't know how to erase the text associated to the h2
tag, so those lines become "Paragraph text" instead of "Header2Paragraph text". What can I do? Thanks
Use .get_text(split=' ')
instead of .text
and you get text with space "Header2 Paragraph text"
You can also use different char - ie. "|" - .get_text(split='|')
and you get "Header2|Paragraph text"
.
And then you can use split("|")
to get list ["Header2", "Paragraph text"]
and keep last element.
You can also find h2
and clear()
or extract()
this tag and later you can get text from all div
and you get without "Header2"
Documentation: get_text() , clear() , extract()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.