简体   繁体   English

<strong><em>使用lxml和请求抓取html时</em></strong>包含带有<strong>和<em>标签的</em></strong>文本<strong><em>?</em></strong>

[英]Including text with <strong> and <em> tags when scraping html using lxml & requests?

I'm scraping text from a webpage using lxml and requests. 我正在使用lxml和请求从网页上抓取文本。 All of the text that I want is under <p> tags. 我想要的所有文本都在<p>标签下。 When I use contents = tree.xpath('//*[@id="storytext"]/p/text()') , contents only includes text that is not in <em> or <strong> tags. 当我使用contents = tree.xpath('//*[@id="storytext"]/p/text()')contents只包含不在<em><strong>标签中的文本。 But when I use contents = tree.xpath('//*[@id="storytext"]/p/text() | //*[@id="storytext"]/p/strong/text() | //*[@id="storytext"]/p/em/text()') , the text in <em> and <strong> tabs is separated from the rest of the text in that <p> tag. 但是当我使用contents = tree.xpath('//*[@id="storytext"]/p/text() | //*[@id="storytext"]/p/strong/text() | //*[@id="storytext"]/p/em/text()')<em><strong>标签中的文本与该<p>标签中的其余文本分开。

I would like to: 我想要:

  1. scrape each <p> as a unit, including all its text (whether plain or <em> or <strong> ), and 将每个<p>作为一个单元,包括其所有文本(无论是普通文本还是<em><strong> ),以及

  2. keep the <em> and <strong> tags so that I can use them later to format the text I've scraped. 保留<em><strong>标签,以便我以后可以使用它们来格式化我所抓取的文本。

Sample html: <div id="storytext"><p>"Go <em>away!</em>" His voice was drowned out by the mixer. She didn't even <em>hear</em> him. He could scrub it all day, probably, and Esti would just say <em>can't you do anything</em>? He scowled fiercely at the dirt.</p></div> 示例html: <div id="storytext"><p>"Go <em>away!</em>" His voice was drowned out by the mixer. She didn't even <em>hear</em> him. He could scrub it all day, probably, and Esti would just say <em>can't you do anything</em>? He scowled fiercely at the dirt.</p></div> <div id="storytext"><p>"Go <em>away!</em>" His voice was drowned out by the mixer. She didn't even <em>hear</em> him. He could scrub it all day, probably, and Esti would just say <em>can't you do anything</em>? He scowled fiercely at the dirt.</p></div>

Desired output: "Go <em>away!</em>" His voice was drowned out by the mixer. She didn't even <em>hear</em> him. He could scrub it all day, probably, and Esti would just say <em>can't you do anything</em>? He scowled fiercely at the dirt. 期望的输出: "Go <em>away!</em>" His voice was drowned out by the mixer. She didn't even <em>hear</em> him. He could scrub it all day, probably, and Esti would just say <em>can't you do anything</em>? He scowled fiercely at the dirt. "Go <em>away!</em>" His voice was drowned out by the mixer. She didn't even <em>hear</em> him. He could scrub it all day, probably, and Esti would just say <em>can't you do anything</em>? He scowled fiercely at the dirt.

If only those between you could use bs4 and replace to remove the p open and close tags 如果只有你之间的那些可以使用bs4和replace删除p打开和关闭标签

from bs4 import BeautifulSoup as bs

html = '''
<div id="storytext"><p>"Go <em>away!</em>" His voice was drowned out by the mixer. She didn't even <em>hear</em> him. He could scrub it all day, probably, and Esti would just say <em>can't you do anything</em>? He scowled fiercely at the dirt.</p></div>
'''

soup = bs(html,'lxml')

for item in soup.select('p'):
    print(str(item).replace('<p>','').replace('</p>',''))

Using requests to source html 使用requests来源html

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('url')
soup = bs(r.content, 'lxml')
for item in soup.select('p'):
    print(str(item).replace('<p>','').replace('</p>',''))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM