[英]Including text with <strong> and <em> tags when scraping html using lxml & requests?
I'm scraping text from a webpage using lxml and requests. 我正在使用lxml和请求从网页上抓取文本。 All of the text that I want is under
<p>
tags. 我想要的所有文本都在
<p>
标签下。 When I use contents = tree.xpath('//*[@id="storytext"]/p/text()')
, contents
only includes text that is not in <em>
or <strong>
tags. 当我使用
contents = tree.xpath('//*[@id="storytext"]/p/text()')
, contents
只包含不在<em>
或<strong>
标签中的文本。 But when I use contents = tree.xpath('//*[@id="storytext"]/p/text() | //*[@id="storytext"]/p/strong/text() | //*[@id="storytext"]/p/em/text()')
, the text in <em>
and <strong>
tabs is separated from the rest of the text in that <p>
tag. 但是当我使用
contents = tree.xpath('//*[@id="storytext"]/p/text() | //*[@id="storytext"]/p/strong/text() | //*[@id="storytext"]/p/em/text()')
, <em>
和<strong>
标签中的文本与该<p>
标签中的其余文本分开。
I would like to: 我想要:
scrape each <p>
as a unit, including all its text (whether plain or <em>
or <strong>
), and 将每个
<p>
作为一个单元,包括其所有文本(无论是普通文本还是<em>
或<strong>
),以及
keep the <em>
and <strong>
tags so that I can use them later to format the text I've scraped. 保留
<em>
和<strong>
标签,以便我以后可以使用它们来格式化我所抓取的文本。
Sample html: <div id="storytext"><p>"Go <em>away!</em>" His voice was drowned out by the mixer. She didn't even <em>hear</em> him. He could scrub it all day, probably, and Esti would just say <em>can't you do anything</em>? He scowled fiercely at the dirt.</p></div>
示例html:
<div id="storytext"><p>"Go <em>away!</em>" His voice was drowned out by the mixer. She didn't even <em>hear</em> him. He could scrub it all day, probably, and Esti would just say <em>can't you do anything</em>? He scowled fiercely at the dirt.</p></div>
<div id="storytext"><p>"Go <em>away!</em>" His voice was drowned out by the mixer. She didn't even <em>hear</em> him. He could scrub it all day, probably, and Esti would just say <em>can't you do anything</em>? He scowled fiercely at the dirt.</p></div>
Desired output: "Go <em>away!</em>" His voice was drowned out by the mixer. She didn't even <em>hear</em> him. He could scrub it all day, probably, and Esti would just say <em>can't you do anything</em>? He scowled fiercely at the dirt.
期望的输出:
"Go <em>away!</em>" His voice was drowned out by the mixer. She didn't even <em>hear</em> him. He could scrub it all day, probably, and Esti would just say <em>can't you do anything</em>? He scowled fiercely at the dirt.
"Go <em>away!</em>" His voice was drowned out by the mixer. She didn't even <em>hear</em> him. He could scrub it all day, probably, and Esti would just say <em>can't you do anything</em>? He scowled fiercely at the dirt.
If only those between you could use bs4 and replace
to remove the p open and close tags 如果只有你之间的那些可以使用bs4和
replace
删除p打开和关闭标签
from bs4 import BeautifulSoup as bs
html = '''
<div id="storytext"><p>"Go <em>away!</em>" His voice was drowned out by the mixer. She didn't even <em>hear</em> him. He could scrub it all day, probably, and Esti would just say <em>can't you do anything</em>? He scowled fiercely at the dirt.</p></div>
'''
soup = bs(html,'lxml')
for item in soup.select('p'):
print(str(item).replace('<p>','').replace('</p>',''))
Using requests
to source html 使用
requests
来源html
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('url')
soup = bs(r.content, 'lxml')
for item in soup.select('p'):
print(str(item).replace('<p>','').replace('</p>',''))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.