简体   繁体   中英

Extract tag text from line BeautifulSoup

Recently I've been working on a scraping project. I'm kinda new to it, but could manage to do almost everything, but I'm having trouble with a little issue. I captured every line of a news article doing this:

lines=bs.find('div',{'class':'Text'}).find_all('div')

But for some reason, there's some lines that contain an h2 tag and a br tag, like this one:

 <div><div><h2>Header2</h2></div><div><br/></div><div>Paragraph text

So if I run .text on that snippet I get "Header2Paragraph text". I've got the "Header2" text stored in other line, so I want to delete this second apparition.

I managed to isolate those lines doing this:

for n,t in enumerate(lines):
    if t.find('h2') is not None and t.find('br') is not None:
        print('\n',n,':',t)

But I don't know how to erase the text associated to the h2 tag, so those lines become "Paragraph text" instead of "Header2Paragraph text". What can I do? Thanks

Use .get_text(split=' ') instead of .text and you get text with space "Header2 Paragraph text"

You can also use different char - ie. "|" - .get_text(split='|') and you get "Header2|Paragraph text" .

And then you can use split("|") to get list ["Header2", "Paragraph text"] and keep last element.


You can also find h2 and clear() or extract() this tag and later you can get text from all div and you get without "Header2"


Documentation: get_text() , clear() , extract()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM