使用lxml和请求抓取html时包含带有和标签的文本？

Question

I'm scraping text from a webpage using lxml and requests. 我正在使用lxml和请求从网页上抓取文本。 All of the text that I want is under  tags. 我想要的所有文本都在标签下。 When I use contents = tree.xpath('//*[@id="storytext"]/p/text()') , contents only includes text that is not in  or  tags. 当我使用contents = tree.xpath('//*[@id="storytext"]/p/text()') ， contents只包含不在或标签中的文本。 But when I use contents = tree.xpath('//*[@id="storytext"]/p/text() | //*[@id="storytext"]/p/strong/text() | //*[@id="storytext"]/p/em/text()') , the text in  and  tabs is separated from the rest of the text in that  tag. 但是当我使用contents = tree.xpath('//*[@id="storytext"]/p/text() | //*[@id="storytext"]/p/strong/text() | //*[@id="storytext"]/p/em/text()') ， 和标签中的文本与该标签中的其余文本分开。

I would like to: 我想要：

scrape each  as a unit, including all its text (whether plain or  or  ), and 将每个作为一个单元，包括其所有文本（无论是普通文本还是或 ），以及
keep the  and  tags so that I can use them later to format the text I've scraped. 保留和标签，以便我以后可以使用它们来格式化我所抓取的文本。

Sample html: <div id="storytext">"Go away!" His voice was drowned out by the mixer. She didn't even hear him. He could scrub it all day, probably, and Esti would just say can't you do anything? He scowled fiercely at the dirt.</div> 示例html： <div id="storytext">"Go away!" His voice was drowned out by the mixer. She didn't even hear him. He could scrub it all day, probably, and Esti would just say can't you do anything? He scowled fiercely at the dirt.</div> <div id="storytext">"Go away!" His voice was drowned out by the mixer. She didn't even hear him. He could scrub it all day, probably, and Esti would just say can't you do anything? He scowled fiercely at the dirt.</div>

Desired output: "Go away!" His voice was drowned out by the mixer. She didn't even hear him. He could scrub it all day, probably, and Esti would just say can't you do anything? He scowled fiercely at the dirt. 期望的输出： "Go away!" His voice was drowned out by the mixer. She didn't even hear him. He could scrub it all day, probably, and Esti would just say can't you do anything? He scowled fiercely at the dirt. "Go away!" His voice was drowned out by the mixer. She didn't even hear him. He could scrub it all day, probably, and Esti would just say can't you do anything? He scowled fiercely at the dirt.

Answer 1

If only those between you could use bs4 and replace to remove the p open and close tags 如果只有你之间的那些可以使用bs4和replace删除p打开和关闭标签

from bs4 import BeautifulSoup as bs

html = '''
<div id="storytext"><p>"Go <em>away!</em>" His voice was drowned out by the mixer. She didn't even <em>hear</em> him. He could scrub it all day, probably, and Esti would just say <em>can't you do anything</em>? He scowled fiercely at the dirt.</p></div>
'''

soup = bs(html,'lxml')

for item in soup.select('p'):
    print(str(item).replace('<p>','').replace('</p>',''))

Using requests to source html 使用requests来源html

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('url')
soup = bs(r.content, 'lxml')
for item in soup.select('p'):
    print(str(item).replace('<p>','').replace('</p>',''))

<strong><em>使用lxml和请求抓取html时</em></strong>包含带有<strong>和<em>标签的</em></strong>文本<strong><em>？</em></strong>

问题描述

1 个解决方案

解决方案1
0 2019-04-21 20:24:04

<strong><em>使用lxml和请求抓取html时</em></strong>包含带有<strong>和<em>标签的</em></strong>文本<strong><em>？</em></strong>

问题描述

1 个解决方案

解决方案1 0 2019-04-21 20:24:04

解决方案1
0 2019-04-21 20:24:04