简体   繁体   English

使用漂亮的汤和python无法找到xml标签

[英]Can't find xml tag using beautiful soup and python

I'm trying to pull the image:title tag with specific keywords from an xml page. 我正在尝试从xml页面中提取带有特定关键字的image:title标签。 The keywords work fine if i just search on loc tags. 如果我只是在loc标签上搜索,这些关键字就可以正常工作。 Code below 下面的代码

print("Searching for product...")
        keywordLinkFound = False
        while keywordLinkFound is False:
            html = self.driver.page_source
            soup = BeautifulSoup(html, 'xml')
            try:
                regexp = "%s.*%s|%s.%s" % (keyword1, keyword2, keyword2, keyword1)
                keywordLink = soup.find('image:title', text=re.compile(regexp))
                print(keywordLink)
                return keywordLink
            except AttributeError:
                print("Product not found on site, retrying...")
                time.sleep(monitorDelay)
                self.driver.refresh()
            break

Here is the xml code that im parsing: 这是即时消息解析的xml代码:

<url>
<loc>
   https://packershoes.com/products/copy-of-adidas-predator-accelerator-trainer
</loc>
<lastmod>2018-11-24T08:22:42-05:00</lastmod>
<changefreq>daily</changefreq>
<image:image>
    <image:loc>
    https://cdn.shopify.com/s/files/1/0208/5268/products/adidas_Yung-1_B37616_side.jpg?v=1537395620
    </image:loc>
    <image:title>ADIDAS YUNG-1 "CLOUD WHITE"</image:title>
</image:image>
</url>

It seems that I am unable to get to the image:title tag 看来我无法到达image:title标签

This finds the text within the <image:title> : 这将在<image:title>找到文本:

soup.findAll('image')[0].findAll('title')[0].text

or you can do 或者你可以做

soup.image.title.text

with the output: 输出:

'ADIDAS YUNG-1 "CLOUD WHITE"'

You should use the built-in methods in BeautifulSoup ( documentation ) instead of regular expressions. 您应该使用BeautifulSoup文档 )中的内置方法代替正则表达式。 The benefit of using BeatifulSoup for parsing HTML is that you can take advantage of the structured form of the language. 使用BeatifulSoup解析HTML的好处是您可以利用语言的结构形式。

Edit 编辑

Here is the complete working code: 这是完整的工作代码:

from bs4 import BeautifulSoup

html = """
<url>
<loc>
   https://packershoes.com/products/copy-of-adidas-predator-accelerator-trainer
</loc>
<lastmod>2018-11-24T08:22:42-05:00</lastmod>
<changefreq>daily</changefreq>
<image:image>
    <image:loc>
    https://cdn.shopify.com/s/files/1/0208/5268/products/adidas_Yung-1_B37616_side.jpg?v=1537395620
    </image:loc>
    <image:title>ADIDAS YUNG-1 "CLOUD WHITE"</image:title>
</image:image>
</url>
"""

soup = BeautifulSoup(html, 'xml')
soup.image.title.text

Output: 输出:

'ADIDAS YUNG-1 "CLOUD WHITE"'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM