I have this html:
<td class="name-td alLeft bordR">13.10.2016, Thu<span class="sp">|</span>17:00</td>
I want to get a date (13.10.2016) and a time (17:00).
I'm doing that:
t = lxml.html.parse(url)
nextMatchDate = t.findall(".//td[@class='bordR']")[count].text
But getting an error,
IndexError: list index out of range
I think it happens because I have a html-tags in a
tag
Could you help me please?
The problem is in the way you check for the bordR
class. class
is a multi-valued space-delimited attribute and you have to account for other classes on an element. In XPath you should be using "contains":
.//td[contains(@class, 'bordR')]
Or, even more reliable would be to add "concat" to the partial match check .
Once you've located the element you can use .text_content()
method to get the complete text including all the children:
In [1]: from lxml.html import fromstring
In [2]: data = '<td class="name-td alLeft bordR">13.10.2016, Thu<span class="sp">|</span>17:00</td>'
In [3]: td = fromstring(data)
In [4]: print(td.text_content())
13.10.2016, Thu|17:00
To take a step further, you can load the date string into a datetime
object :
In [5]: from datetime import datetime
In [6]: datetime.strptime(td.text_content(), "%d.%m.%Y, %a|%H:%M")
Out[6]: datetime.datetime(2016, 10, 13, 17, 0)
There's a method called .itertext
that:
Iterates over the text content of a subtree.
So if you have an element td
in a variable td
, you can do this:
>>> text = list(td.itertext()); text
['13.10.2016, Thu', '|', '17:00']
>>> date, time = text[0].split(',')[0], text[-1]
>>> datetime_text = '{} at {}'.format(date, time)
>>> datetime_text
'13.10.2016 at 17:00'
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.