简体   繁体   English

如何从 python 中的 HTML 中提取时间 class?

[英]How to extract time class from HTML in python?

I have a piece of HTML code in python through beautifulsoup but am unable to retrieve the desired time tag from it.我在 python 到 beautifulsoup 中有一段 HTML 代码,但无法从中检索所需的时间标签。

HTML is called K:

<time class="dtstart" datetime="05 December 201710:30 AM GMT" id="x-event-date" xcdate="1512469800950">
<a class="action pull-right print-cat" data-href="/en/aus/2017/some-url-data-l17407.html" data-modalid="catalogueModal" data-toggle="modal" href="/en/auctions/ecatalogue/lot.print.L17407.html" style="display: none;">Print My Catalogue (0)</a>
<ul class="breadcrumb inline">
<li>
<a href="/en/aus/2017/some-url-data-l17407.html"><span class="active">Smartphone and watches</span></a>
</li>
</ul>
</time>    

I can extract all tags except time:我可以提取除时间以外的所有标签:

K.a :
<a class="action pull-right print-cat" data-href="/en/aus/2017/some-url-data-l17407.html" data-modalid="catalogueModal" data-toggle="modal" href="/en/auctions/ecatalogue/lot.print.L17407.html" style="display: none;">Print My Catalogue (0)</a>

K.li:
<li>
<a href="/en/aus/2017/some-url-data-l17407.html"><span class="active">Smartphone and watches</span></a>
</li>

K.time:
Nothing prints

I have also tried the following solution:我也尝试了以下解决方案:

K.find('time', {'class':'dtstart'})
Nothing prints

K.find('a', {'class':'action pull-right print-cat'})
<a class="action pull-right print-cat" data-href="/en/aus/2017/some-url-data-l17407.html" data-modalid="catalogueModal" data-toggle="modal" href="/en/auctions/ecatalogue/lot.print.L17407.html" style="display: none;">Print My Catalogue (0)</a>

When we inspect K we see the following:当我们检查 K 时,我们会看到以下内容:

Signature:      K(*args, **kwargs)
Type:           Tag
String form:   
<time class="dtstart" datetime="05 December 201710:30 AM GMT" id="x-event-date" xcdate="1512469800950">
<a class="action pull-right print-cat" data-href="/en/aus/2017/some-url-data-l17407.html" data-modalid="catalogueModal" data-toggle="modal" href="/en/auctions/ecatalogue/lot.print.L17407.html" style="display: none;">Print My Catalogue (0)</a>
<ul class="breadcrumb inline">
<li>
<a href="/en/aus/2017/some-url-data-l17407.html"><span class="active">Smartphone and watches</span></a>
</li>
</ul>
</time>  
Length:         5
File:           ~/.local/lib/python3.6/site-packages/bs4/element.py
Source:    

How is it possible the time tag isn't being extracted?时间标签怎么可能没有被提取?

You need to double check the html code your receiving in your script.您需要仔细检查您在脚本中收到的 html 代码。 using a minimal example with the html in your question, its clear that bs4 can get a time tag.在您的问题中使用 html 的最小示例,很明显 bs4 可以获得时间标签。

from bs4 import BeautifulSoup

html_string = """<time class="dtstart" datetime="05 December 201710:30 AM GMT" id="x-event-date" xcdate="1512469800950">
<a class="action pull-right print-cat" data-href="/en/aus/2017/some-url-data-l17407.html" data-modalid="catalogueModal" data-toggle="modal" href="/en/auctions/ecatalogue/lot.print.L17407.html" style="display: none;">Print My Catalogue (0)</a>
<ul class="breadcrumb inline">
<li>
<a href="/en/aus/2017/some-url-data-l17407.html"><span class="active">Smartphone and watches</span></a>
</li>
</ul>
</time>"""

k = BeautifulSoup(html_string, features="lxml")
print(k.time.attrs)

OUTPUT OUTPUT

{'class': ['dtstart'], 'datetime': '05 December 201710:30 AM GMT', 'id': 'x-event-date', 'xcdate': '1512469800950'}

I am still unsure why I am not able to receive it in the first place, but Chris Doyle paved the way to succes.我仍然不确定为什么我一开始就无法收到它,但 Chris Doyle 为成功铺平了道路。 We can simply resoup it and get the desired result:我们可以简单地重新调整它并获得所需的结果:

Date=soup(str(K), "html.parser").time.attrs["datetime"]
print(Date)

#Output
{'class': ['dtstart'], 'datetime': '05 December 201710:30 AM GMT', 'id': 'x-event-date', 'xcdate': '1512469800950'}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM