[英]Extract tag text from line BeautifulSoup
Recently I've been working on a scraping project. 最近我一直在做一个刮刮项目。 I'm kinda new to it, but could manage to do almost everything, but I'm having trouble with a little issue.
我有点新手,但可以设法做几乎所有事情,但我遇到一个小问题。 I captured every line of a news article doing this:
我抓住了一篇新闻文章的每一行:
lines=bs.find('div',{'class':'Text'}).find_all('div')
But for some reason, there's some lines that contain an h2
tag and a br
tag, like this one: 但由于某种原因,有一些行包含
h2
标签和br
标签,如下所示:
<div><div><h2>Header2</h2></div><div><br/></div><div>Paragraph text
So if I run .text
on that snippet I get "Header2Paragraph text". 因此,如果我在该片段上运行
.text
,我会收到“Header2Paragraph text”。 I've got the "Header2" text stored in other line, so I want to delete this second apparition. 我已将“Header2”文本存储在其他行中,因此我想删除第二个幻像。
I managed to isolate those lines doing this: 我设法隔离这些行:
for n,t in enumerate(lines):
if t.find('h2') is not None and t.find('br') is not None:
print('\n',n,':',t)
But I don't know how to erase the text associated to the h2
tag, so those lines become "Paragraph text" instead of "Header2Paragraph text". 但我不知道如何擦除与
h2
标签关联的文本,因此这些行变为“段落文本”而不是“Header2Paragraph文本”。 What can I do? 我能做什么? Thanks
谢谢
Use .get_text(split=' ')
instead of .text
and you get text with space "Header2 Paragraph text"
使用
.get_text(split=' ')
而不是.text
,你得到带有空格"Header2 Paragraph text"
You can also use different char - ie. 你也可以使用不同的char - 即。 "|"
“|” -
.get_text(split='|')
and you get "Header2|Paragraph text"
. -
.get_text(split='|')
,你得到"Header2|Paragraph text"
。
And then you can use split("|")
to get list ["Header2", "Paragraph text"]
and keep last element. 然后你可以使用
split("|")
获取列表["Header2", "Paragraph text"]
并保留最后一个元素。
You can also find h2
and clear()
or extract()
this tag and later you can get text from all div
and you get without "Header2"
你也可以找到
h2
和clear()
或extract()
这个标签,然后你可以从所有div
获得文本,你得到没有"Header2"
Documentation: get_text() , clear() , extract() 文档: get_text() , clear() , extract()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.