在美丽的汤刮文本与文本

Question

我正在尝试用python和美丽的汤bs4刮一页

我想将文本与本文中的表情符号一起保留在页面的<p>元素中。

第一次尝试是：

import urllib
import urllib.request
from bs4 import BeautifulSoup

urlobject = urllib.request.urlopen("https://example.com")

soup = BeautifulSoup(urlobject, "lxml")

result= list(map(lambda e: e.getText(), soup.find_all("p", {"class": "text"})))

但这不包括表情符号。 然后我尝试删除.getText()并保持：

result= list(map(lambda e: e, soup.find_all("p", {"class": "text"})))

这让我意识到这个网站上的表情符号是img标签的alt ：

<p class="text">I love the night<img alt="🌟" class="emoji" src="etc"/><span>!</span></p>

所以我想做的是：

带有类text p getText（）
但也可以使用class=emoji获取img alt

并将文本和表情符号保留为一句话。

有没有办法做到这一点？

任何帮助，将不胜感激。

Answer 1

如下所示，返回每个p的目标数据的元组？ 我刚刚使用你的示例p元素作为此测试的输入：

from bs4 import BeautifulSoup

s = """
<p class="text">I love the night<img alt="🌟" class="emoji" src="etc"/><span>!</span></p>
<p class="text">I love the night<img alt="🌟" class="emoji" src="etc"/><span>!</span></p>
"""

soup = BeautifulSoup(s, 'lxml')

elements = soup.find_all('p', {'class': 'text'})
print(list(map(lambda e: (e.getText(), e.find('img', {'class': 'emoji'})['alt']), elements)))

结果：

[('I love the night!', '🌟'), ('I love the night!', '🌟')]

Answer 2

如果img.emoji是可选的，你可以尝试下面，它将保留表情符号的位置

urlobject = '''<p class="text">I love the night<img alt="🌟" class="emoji" src="etc"/><span>!</span></p>
<p class="text">I love the day<span>!</span></p>
<p class="text">I love the music<img alt="🌟" class="emoji" src="etc"/> <img alt="🔊" class="emoji" src="etc"/><span>!</span></p>
'''

result = []
for p in soup.find_all('p', {'class': 'text'}):
    emoji = p.select('img.emoji')
    if emoji:
        for em in emoji:
            index = p.contents.index(em)
            p.contents[index].replace_with(em['alt'])
    result.append(p.getText())

print(result)

结果：

['I love the night🌟!', 'I love the day!', 'I love the music🌟 🔊!']

在美丽的汤刮文本与文本

问题描述

2 个解决方案

解决方案1
1 2018-12-26 20:13:00

解决方案2
1 已采纳 2018-12-27 00:24:36

在美丽的汤刮文本与文本

问题描述

2 个解决方案

解决方案1 1 2018-12-26 20:13:00

解决方案2 1 已采纳 2018-12-27 00:24:36

解决方案1
1 2018-12-26 20:13:00

解决方案2
1 已采纳 2018-12-27 00:24:36