在美丽的汤刮文本与文本

Question

I am trying to scrape a page using python and beautiful soup bs4 我正在尝试用python和美丽的汤bs4刮一页

I want to keep the text in the <p> element in the page along with the emojis in this text. 我想将文本与本文中的表情符号一起保留在页面的<p>元素中。

The first attempt was: 第一次尝试是：

import urllib
import urllib.request
from bs4 import BeautifulSoup

urlobject = urllib.request.urlopen("https://example.com")

soup = BeautifulSoup(urlobject, "lxml")

result= list(map(lambda e: e.getText(), soup.find_all("p", {"class": "text"})))

But this doesn't include emojis. 但这不包括表情符号。 I then tried to remove .getText() and just keep : 然后我尝试删除.getText()并保持：

result= list(map(lambda e: e, soup.find_all("p", {"class": "text"})))

Which made me realize the emojis in this website are in the alt of img tags: 这让我意识到这个网站上的表情符号是img标签的alt ：

<p class="text">I love the night<img alt="🌟" class="emoji" src="etc"/><span>!</span></p>

So what I want to do is : 所以我想做的是：

getText() for p with class text 带有类text p getText（）
But also get alt for img with class=emoji 但也可以使用class=emoji获取img alt

And keep the text and the emojis as one sentence. 并将文本和表情符号保留为一句话。

Is there any way to do this? 有没有办法做到这一点？

Any help would be appreciated. 任何帮助，将不胜感激。

Answer 1

How about the following, returning tuples of the targeted data for each p ? 如下所示，返回每个p的目标数据的元组？ I just used your example p element twice as the input for this test: 我刚刚使用你的示例p元素作为此测试的输入：

from bs4 import BeautifulSoup

s = """
<p class="text">I love the night<img alt="🌟" class="emoji" src="etc"/><span>!</span></p>
<p class="text">I love the night<img alt="🌟" class="emoji" src="etc"/><span>!</span></p>
"""

soup = BeautifulSoup(s, 'lxml')

elements = soup.find_all('p', {'class': 'text'})
print(list(map(lambda e: (e.getText(), e.find('img', {'class': 'emoji'})['alt']), elements)))

Result: 结果：

[('I love the night!', '🌟'), ('I love the night!', '🌟')]

Answer 2

if the img.emoji are optional you can try below, and it will preserve emoji position 如果img.emoji是可选的，你可以尝试下面，它将保留表情符号的位置

urlobject = '''<p class="text">I love the night<img alt="🌟" class="emoji" src="etc"/><span>!</span></p>
<p class="text">I love the day<span>!</span></p>
<p class="text">I love the music<img alt="🌟" class="emoji" src="etc"/> <img alt="🔊" class="emoji" src="etc"/><span>!</span></p>
'''

result = []
for p in soup.find_all('p', {'class': 'text'}):
    emoji = p.select('img.emoji')
    if emoji:
        for em in emoji:
            index = p.contents.index(em)
            p.contents[index].replace_with(em['alt'])
    result.append(p.getText())

print(result)

Results: 结果：

['I love the night🌟!', 'I love the day!', 'I love the music🌟 🔊!']

在美丽的汤刮文本与文本

问题描述

2 个解决方案

解决方案1
1 2018-12-26 20:13:00

解决方案2
1 已采纳 2018-12-27 00:24:36

在美丽的汤刮文本与文本

问题描述

2 个解决方案

解决方案1 1 2018-12-26 20:13:00

解决方案2 1 已采纳 2018-12-27 00:24:36

解决方案1
1 2018-12-26 20:13:00

解决方案2
1 已采纳 2018-12-27 00:24:36