简体   繁体   English

在美丽的汤刮文本与文本

[英]scrape emojis with text in beautiful soup

I am trying to scrape a page using python and beautiful soup bs4 我正在尝试用python和美丽的汤bs4刮一页

I want to keep the text in the <p> element in the page along with the emojis in this text. 我想将文本与本文中的表情符号一起保留在页面的<p>元素中。

The first attempt was: 第一次尝试是:

import urllib
import urllib.request
from bs4 import BeautifulSoup

urlobject = urllib.request.urlopen("https://example.com")

soup = BeautifulSoup(urlobject, "lxml")

result= list(map(lambda e: e.getText(), soup.find_all("p", {"class": "text"})))

But this doesn't include emojis. 但这不包括表情符号。 I then tried to remove .getText() and just keep : 然后我尝试删除.getText()并保持:

result= list(map(lambda e: e, soup.find_all("p", {"class": "text"})))

Which made me realize the emojis in this website are in the alt of img tags: 这让我意识到这个网站上的表情符号是img标签的alt

<p class="text">I love the night<img alt="🌟" class="emoji" src="etc"/><span>!</span></p>

So what I want to do is : 所以我想做的是:

  • getText() for p with class text 带有类text p getText()
  • But also get alt for img with class=emoji 但也可以使用class=emoji获取img alt

And keep the text and the emojis as one sentence. 并将文本和表情符号保留为一句话。

Is there any way to do this? 有没有办法做到这一点?

Any help would be appreciated. 任何帮助,将不胜感激。

How about the following, returning tuples of the targeted data for each p ? 如下所示,返回每个p的目标数据的元组? I just used your example p element twice as the input for this test: 我刚刚使用你的示例p元素作为此测试的输入:

from bs4 import BeautifulSoup

s = """
<p class="text">I love the night<img alt="🌟" class="emoji" src="etc"/><span>!</span></p>
<p class="text">I love the night<img alt="🌟" class="emoji" src="etc"/><span>!</span></p>
"""

soup = BeautifulSoup(s, 'lxml')

elements = soup.find_all('p', {'class': 'text'})
print(list(map(lambda e: (e.getText(), e.find('img', {'class': 'emoji'})['alt']), elements)))

Result: 结果:

[('I love the night!', '🌟'), ('I love the night!', '🌟')]

if the img.emoji are optional you can try below, and it will preserve emoji position 如果img.emoji是可选的,你可以尝试下面,它将保留表情符号的位置

urlobject = '''<p class="text">I love the night<img alt="🌟" class="emoji" src="etc"/><span>!</span></p>
<p class="text">I love the day<span>!</span></p>
<p class="text">I love the music<img alt="🌟" class="emoji" src="etc"/> <img alt="🔊" class="emoji" src="etc"/><span>!</span></p>
'''

result = []
for p in soup.find_all('p', {'class': 'text'}):
    emoji = p.select('img.emoji')
    if emoji:
        for em in emoji:
            index = p.contents.index(em)
            p.contents[index].replace_with(em['alt'])
    result.append(p.getText())

print(result)

Results: 结果:

['I love the night🌟!', 'I love the day!', 'I love the music🌟 🔊!']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM