简体   繁体   English

从BeautifulSoup行中提取标记文本

[英]Extract tag text from line BeautifulSoup

Recently I've been working on a scraping project. 最近我一直在做一个刮刮项目。 I'm kinda new to it, but could manage to do almost everything, but I'm having trouble with a little issue. 我有点新手,但可以设法做几乎所有事情,但我遇到一个小问题。 I captured every line of a news article doing this: 我抓住了一篇新闻文章的每一行:

lines=bs.find('div',{'class':'Text'}).find_all('div')

But for some reason, there's some lines that contain an h2 tag and a br tag, like this one: 但由于某种原因,有一些行包含h2标签和br标签,如下所示:

 <div><div><h2>Header2</h2></div><div><br/></div><div>Paragraph text

So if I run .text on that snippet I get "Header2Paragraph text". 因此,如果我在该片段上运行.text ,我会收到“Header2Paragraph text”。 I've got the "Header2" text stored in other line, so I want to delete this second apparition. 我已将“Header2”文本存储在其他行中,因此我想删除第二个幻像。

I managed to isolate those lines doing this: 我设法隔离这些行:

for n,t in enumerate(lines):
    if t.find('h2') is not None and t.find('br') is not None:
        print('\n',n,':',t)

But I don't know how to erase the text associated to the h2 tag, so those lines become "Paragraph text" instead of "Header2Paragraph text". 但我不知道如何擦除与h2标签关联的文本,因此这些行变为“段落文本”而不​​是“Header2Paragraph文本”。 What can I do? 我能做什么? Thanks 谢谢

Use .get_text(split=' ') instead of .text and you get text with space "Header2 Paragraph text" 使用.get_text(split=' ')而不是.text ,你得到带有空格"Header2 Paragraph text"

You can also use different char - ie. 你也可以使用不同的char - 即。 "|" “|” - .get_text(split='|') and you get "Header2|Paragraph text" . - .get_text(split='|') ,你得到"Header2|Paragraph text"

And then you can use split("|") to get list ["Header2", "Paragraph text"] and keep last element. 然后你可以使用split("|")获取列表["Header2", "Paragraph text"]并保留最后一个元素。


You can also find h2 and clear() or extract() this tag and later you can get text from all div and you get without "Header2" 你也可以找到h2clear()extract()这个标签,然后你可以从所有div获得文本,你得到没有"Header2"


Documentation: get_text() , clear() , extract() 文档: get_text()clear()extract()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM