Extract tag text from line BeautifulSoup

Question

Recently I've been working on a scraping project. I'm kinda new to it, but could manage to do almost everything, but I'm having trouble with a little issue. I captured every line of a news article doing this:

lines=bs.find('div',{'class':'Text'}).find_all('div')

But for some reason, there's some lines that contain an h2 tag and a br tag, like this one:

 <div><div><h2>Header2</h2></div><div><br/></div><div>Paragraph text

So if I run .text on that snippet I get "Header2Paragraph text". I've got the "Header2" text stored in other line, so I want to delete this second apparition.

I managed to isolate those lines doing this:

for n,t in enumerate(lines):
    if t.find('h2') is not None and t.find('br') is not None:
        print('\n',n,':',t)

But I don't know how to erase the text associated to the h2 tag, so those lines become "Paragraph text" instead of "Header2Paragraph text". What can I do? Thanks

Answer 1

Use .get_text(split=' ') instead of .text and you get text with space "Header2 Paragraph text"

You can also use different char - ie. "|" - .get_text(split='|') and you get "Header2|Paragraph text" .

And then you can use split("|") to get list ["Header2", "Paragraph text"] and keep last element.

You can also find h2 and clear() or extract() this tag and later you can get text from all div and you get without "Header2"

Documentation: get_text() , clear() , extract()

Extract tag text from line BeautifulSoup

Question

1 answers

solution1
0 2019-04-16 04:00:23

Extract tag text from line BeautifulSoup

Question

1 answers

solution1 0 2019-04-16 04:00:23

solution1
0 2019-04-16 04:00:23