简体   繁体   English

BeautifulSoup4不能仅从标签中提取文本

[英]BeautifulSoup4 can't extract only text from a tag

I am trying to extract title, description and url from every item in an xml file, but I am having trouble to extract the text of the description tag without the tags inside it. 我正在尝试从xml文件中的每个项目中提取标题,描述和url,但是我无法提取其中没有标记的description标记的文本。

Here is my code: 这是我的代码:

import urllib.request
from bs4 import BeautifulSoup


def read_xml(url):
"""reads xml string from url"""

    with urllib.request.urlopen(url) as source:
        html=source.read()

    return BeautifulSoup(html,'xml')

def read_content(html_file):
    """reads title,description and url from xml file"""

    content={'title':[],'description':[],'url':[]}

    item_lines=html_file.find_all('item')


    #item_lines is a list of the content within <item></item> tags
    for item in item_lines:
        content['title'].append(item.title.string)
        content['description'].append(item.description.text[:50]+"..")
        content['url'].append(item.link.text)

    return content

soup=read_xml('https://www.gamespot.com/feeds/game-news/')

content=read_content(soup)

for content in display_content.values():
    print(content)
    print("\n")

And this is the output (only showing the first elements of the lists): 这是输出(仅显示列表的第一个元素):

['Fortnite Guide: Week 2 Secret Battle Banner Location (Season 6 Hunting Party Challenge)', 'Getting Away With Crime In Red Dead Redemption 2 Is Tricky', "This Is How Red Dead Redemption 2's Cores, Health, And Stats Work", "Red Dead Redemption 2: Here's How The Horses ...]

['<p>Season 6 of <a href="https://www.gamespot.com/f..', '<p><a href="https://www.gamespot.com/red-dead-rede..', '<p>In terms of scale, scope, gameplay systems, and..', '<p>One of the key areas of <a href="https://www.ga..', '<p>Week 2 of <a href="https://www.gamespot.com/for..', '<p>Forza Horizon is back for another year, and tha..', '<p>From all that we\'ve seen of ...]


['https://www.gamespot.com/articles/fortnite-guide-week-2-secret-battle-banner-locatio/1100-6462272/', 'https://www.gamespot.com/articles/getting-away-with-crime-in-red-dead-redemption-2-i/1100-6462203/', 'https://www.gamespot.com/articles/this-is-how-red-dead-redemption-2s-cores-health-an/1100-6462201/', ...]

As you can see there are p and a tags in the second list, which I am not able to get rid off, I tried .get_text(), .string, .text, .descendants and tried finding a solution in the documentation, most of the time it is the same output. 如您所见,第二个列表中有p和一个标记,我无法摆脱它们,我尝试了.get_text()、. string,.text,.descendants并尝试在文档中找到解决方案,大多数时间是相同的输出。 I also don't want to manually remove those tags, because the program should be applicable for any xml document. 我也不想手动删除这些标签,因为该程序应适用于任何xml文档。

I would really appreciate if you could help me in this matter or point me in the right direction. 如果您能在此问题上为我提供帮助或为我指明正确的方向,我将不胜感激。

As the description is a html element just brew it as a soup with BeautifulSoup and extract text from it. 由于描述是一个html元素,因此只需将它与BeautifulSoup一起制成汤料并从中提取文本即可。

desc = BeautifulSoup(item.description.text, 'html.parser')
content['description'].append(desc.text[:50]+"..")

If you are feeling that as complicated you can use regular expressions to get rid of them. 如果您觉得那么复杂,可以使用正则表达式来摆脱它们。 But i would not personally suggest it because your text may contain normal texts with the same pattern. 但是我个人不建议这样做,因为您的文本可能包含具有相同模式的普通文本。

import re
desc = re.sub("(<.*?>)", "", str(item.description.text), 0, re.IGNORECASE | re.DOTALL | re.MULTILINE)
content['description'].append(desc.text[:50]+"..")

The <.*?> will select all the HTML tags and replace them with empty string. <.*?>将选择所有HTML标记并将其替换为空字符串。

Hope this helps! 希望这可以帮助! Cheers! 干杯!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM