[英]Can't find xml tag using beautiful soup and python
I'm trying to pull the image:title tag with specific keywords from an xml page. 我正在尝试从xml页面中提取带有特定关键字的image:title标签。 The keywords work fine if i just search on loc tags.
如果我只是在loc标签上搜索,这些关键字就可以正常工作。 Code below
下面的代码
print("Searching for product...")
keywordLinkFound = False
while keywordLinkFound is False:
html = self.driver.page_source
soup = BeautifulSoup(html, 'xml')
try:
regexp = "%s.*%s|%s.%s" % (keyword1, keyword2, keyword2, keyword1)
keywordLink = soup.find('image:title', text=re.compile(regexp))
print(keywordLink)
return keywordLink
except AttributeError:
print("Product not found on site, retrying...")
time.sleep(monitorDelay)
self.driver.refresh()
break
Here is the xml code that im parsing: 这是即时消息解析的xml代码:
<url>
<loc>
https://packershoes.com/products/copy-of-adidas-predator-accelerator-trainer
</loc>
<lastmod>2018-11-24T08:22:42-05:00</lastmod>
<changefreq>daily</changefreq>
<image:image>
<image:loc>
https://cdn.shopify.com/s/files/1/0208/5268/products/adidas_Yung-1_B37616_side.jpg?v=1537395620
</image:loc>
<image:title>ADIDAS YUNG-1 "CLOUD WHITE"</image:title>
</image:image>
</url>
It seems that I am unable to get to the image:title tag 看来我无法到达image:title标签
This finds the text within the <image:title>
: 这将在
<image:title>
找到文本:
soup.findAll('image')[0].findAll('title')[0].text
or you can do 或者你可以做
soup.image.title.text
with the output: 输出:
'ADIDAS YUNG-1 "CLOUD WHITE"'
You should use the built-in methods in BeautifulSoup
( documentation ) instead of regular expressions. 您应该使用
BeautifulSoup
( 文档 )中的内置方法代替正则表达式。 The benefit of using BeatifulSoup
for parsing HTML
is that you can take advantage of the structured form of the language. 使用
BeatifulSoup
解析HTML
的好处是您可以利用语言的结构形式。
Edit 编辑
Here is the complete working code: 这是完整的工作代码:
from bs4 import BeautifulSoup
html = """
<url>
<loc>
https://packershoes.com/products/copy-of-adidas-predator-accelerator-trainer
</loc>
<lastmod>2018-11-24T08:22:42-05:00</lastmod>
<changefreq>daily</changefreq>
<image:image>
<image:loc>
https://cdn.shopify.com/s/files/1/0208/5268/products/adidas_Yung-1_B37616_side.jpg?v=1537395620
</image:loc>
<image:title>ADIDAS YUNG-1 "CLOUD WHITE"</image:title>
</image:image>
</url>
"""
soup = BeautifulSoup(html, 'xml')
soup.image.title.text
Output: 输出:
'ADIDAS YUNG-1 "CLOUD WHITE"'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.