简体   繁体   English

BeautifulSoup 找不到标签(区分大小写/不区分大小写的问题)

[英]BeautifulSoup doesn't find tag (case sensitive / insensitive issue)

I have an RSS feed I'm trying to parse with bs4 .我有一个试图用bs4解析的 RSS 提要。 Each item in the feed has this structure, and as far as I can see, all tags are always present.提要中的每个项目都具有这种结构,据我所知,所有标签始终存在。

<item>
    <title>From impeachment, to pandemic, to riots, wildfires and killer hornets, 2020 is proving to be a doozy</title>
    <link>https://www.washingtontimes.com/news/2020/jun/2/from-impeachment-to-pandemic-to-riots-wildfires-an/?utm_source=RSS_Feed&amp;utm_medium=RSS</link>
    <description>&lt;p&gt;Can 2020 get crazier? You bet. We&amp;rsquo;re just getting started.&lt;/p&gt; &lt;p&gt;If you wrote a screenplay of what&amp;rsquo;s happened so far in 2020 and gave it to Hollywood producers, they&amp;rsquo;d laugh you right out of the room.&lt;/p&gt; &lt;p&gt;&amp;ldquo;So, your movie,&amp;rdquo; they&amp;rsquo;d say, &amp;ldquo;has the president of the United States being impeached, ...</description>
    <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Joseph Curl</dc:creator>
    <pubDate>Tue, 02 Jun 2020 16:51:24 -0400</pubDate>
    <guid>https://www.washingtontimes.com/news/2020/jun/2/from-impeachment-to-pandemic-to-riots-wildfires-an/?utm_source=RSS_Feed&amp;utm_medium=RSS</guid>
</item>

I'm trying to parse each item tag with the following code:我正在尝试使用以下代码解析每个item标签:

xml = BeautifulSoup(resp, "html.parser")
for item in xml.findAll("item"):
    curs.execute("INSET INTO rss_items (title, link, description, dc_creator, pub_date, guid)\
                VALUES (%s, %s, %s, %s, %s, %s, %s)", 
                (
                                                item.find("title").text, 
                                                item.find("link").text,
                                                item.find("description").text,
                                                item.find("dc:creator").text,
                                                item.find("pubDate").text,
                                                item.find("guid").text
                ))
    conn.commit()

I know bs4 is reading the feed properly, as if I make the body of the loop a simple print(item.find("title").text) , then the title for each item tag is printed.我知道bs4正在正确读取提要,就好像我将循环的主体设为一个简单的print(item.find("title").text) ,然后打印每个item标签的标题。 Yet when I run this code on my server, I get the following error:然而,当我在服务器上运行此代码时,出现以下错误:

Traceback (most recent call last):
  File "inserts.py", line 21, in <module>
    item.find("pubDate").text,
AttributeError: 'NoneType' object has no attribute 'text'

Why does this error occur, and why only for child tag pubDate , while all previous item.find calls seem to be successful?为什么会发生此错误,为什么仅针对子标签pubDate ,而之前的所有item.find调用似乎都成功了?

Use small letters instead of CamelCase in the element name, as in:在元素名称中使用小写字母而不是 CamelCase,如下所示:

item.find("pubdate")

With the example you provided, it solves the issue.使用您提供的示例,它可以解决问题。

According to the documentation :根据文档

"Because HTML tags and attributes are case-insensitive, all three HTML parsers convert tag and attribute names to lowercase. That is, the markup is converted to. If you want to preserve mixed-case or uppercase tags and attributes, you'll need to parse the document as XML." “因为 HTML 标记和属性不区分大小写,所有三个 HTML 解析器都将标记和属性名称转换为小写。也就是说,标记被转换为。如果要保留混合大小写或大写的标记和属性,则需要将文档解析为 XML。”

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM