简体   繁体   English

BeautifulSoup返回<a>标记的</a>一些奇怪的文本

[英]BeautifulSoup returns some weird text for the <a> tag

I'm new to web scraping and I'm trying to scrape data from this auction website. 我是网络抓取的新手,我正试图从该拍卖网站上抓取数据。 However, I meet this weird problem when trying to get the text of the anchor tag. 但是,在尝试获取anchor标签的文本时,我遇到了这个奇怪的问题。

Here's the HTML: 这是HTML:

<div class="mt50">
  <div class="head_011">
    <a id="item_event_title" href="https://www.storyltd.com/auction/auction.aspx?eid=4158">NO RESERVE AUCTION OF MODERN AND CONTEMPORARY ART  (16-17 APRIL 2019)</a>
  </div>
</div>

Here's my code: 这是我的代码:

auction_info = LTD_work_soup.find('a', id = 'item_event_title').text
print(auction_info)

This prints out "Back To Auction Catalogue" instead of 'NO RESERVE AUCTION OF MODERN AND CONTEMPORARY ART (16-17 APRIL 2019)' , which is what I am expecting. 这将打印出“返回拍卖目录”,而不是我期望的“现代和当代艺术无保留拍卖(2019年4月16日至17日)”

Here's the link to the page. 这是页面的链接

Thank you. 谢谢。

Here how you can extract the NO RESERVE AUCTION OF MODERN AND CONTEMPORARY ART (16-17 APRIL 2019)' from the webpage: 在这里,您可以从网页中提取NO RESERVE AUCTION OF MODERN AND CONTEMPORARY ART (16-17 APRIL 2019)'NO RESERVE AUCTION OF MODERN AND CONTEMPORARY ART (16-17 APRIL 2019)'

from bs4 import BeautifulSoup
import requests

page_link = 'https://www.storyltd.com/auction/item.aspx?eid=4158&amp&lotno=2'
page_response = requests.get(page_link, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
page_content.find("input", attrs={"id":"hdnAuctionTitle"}).attrs['value']

Output: 输出:

NO RESERVE AUCTION OF MODERN AND CONTEMPORARY ART  (16-17 APRIL 2019)

When you check the page_content , you will find out that this sentence is present in an Input Tag. 当您检查page_content ,您会发现此句子出现在Input Tag中。

I hope it helps! 希望对您有所帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM