[英]scrape emojis with text in beautiful soup
I am trying to scrape a page using python and beautiful soup bs4
我正在尝试用python和美丽的汤bs4
刮一页
I want to keep the text in the <p>
element in the page along with the emojis in this text. 我想将文本与本文中的表情符号一起保留在页面的<p>
元素中。
The first attempt was: 第一次尝试是:
import urllib
import urllib.request
from bs4 import BeautifulSoup
urlobject = urllib.request.urlopen("https://example.com")
soup = BeautifulSoup(urlobject, "lxml")
result= list(map(lambda e: e.getText(), soup.find_all("p", {"class": "text"})))
But this doesn't include emojis. 但这不包括表情符号。 I then tried to remove .getText()
and just keep : 然后我尝试删除.getText()
并保持:
result= list(map(lambda e: e, soup.find_all("p", {"class": "text"})))
Which made me realize the emojis in this website are in the alt
of img
tags: 这让我意识到这个网站上的表情符号是img
标签的alt
:
<p class="text">I love the night<img alt="🌟" class="emoji" src="etc"/><span>!</span></p>
So what I want to do is : 所以我想做的是:
p
with class text
带有类text
p
getText() alt
for img
with class=emoji
但也可以使用class=emoji
获取img
alt
And keep the text and the emojis as one sentence. 并将文本和表情符号保留为一句话。
Is there any way to do this? 有没有办法做到这一点?
Any help would be appreciated. 任何帮助,将不胜感激。
How about the following, returning tuples of the targeted data for each p
? 如下所示,返回每个p
的目标数据的元组? I just used your example p
element twice as the input for this test: 我刚刚使用你的示例p
元素作为此测试的输入:
from bs4 import BeautifulSoup
s = """
<p class="text">I love the night<img alt="🌟" class="emoji" src="etc"/><span>!</span></p>
<p class="text">I love the night<img alt="🌟" class="emoji" src="etc"/><span>!</span></p>
"""
soup = BeautifulSoup(s, 'lxml')
elements = soup.find_all('p', {'class': 'text'})
print(list(map(lambda e: (e.getText(), e.find('img', {'class': 'emoji'})['alt']), elements)))
Result: 结果:
[('I love the night!', '🌟'), ('I love the night!', '🌟')]
if the img.emoji
are optional you can try below, and it will preserve emoji position 如果img.emoji
是可选的,你可以尝试下面,它将保留表情符号的位置
urlobject = '''<p class="text">I love the night<img alt="🌟" class="emoji" src="etc"/><span>!</span></p>
<p class="text">I love the day<span>!</span></p>
<p class="text">I love the music<img alt="🌟" class="emoji" src="etc"/> <img alt="🔊" class="emoji" src="etc"/><span>!</span></p>
'''
result = []
for p in soup.find_all('p', {'class': 'text'}):
emoji = p.select('img.emoji')
if emoji:
for em in emoji:
index = p.contents.index(em)
p.contents[index].replace_with(em['alt'])
result.append(p.getText())
print(result)
Results: 结果:
['I love the night🌟!', 'I love the day!', 'I love the music🌟 🔊!']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.