Newbie for python, would like use lxml+xpath to get video link from web page, what I have now is:
import urllib2
from lxml import etree
url=u"http://hkdramas.se/fashion-war-%E6%BD%AE%E6%B5%81%E6%95%99%E4%B8%BB-episode-20/"
xpath=u"//script[contains(.,'label:\"360p\"')]"
html=urllib2.urlopen(url).read()
selector=etree.HTML(html)
get=selector.xpath(xpath)
print get
I've checke type()
of get
, which shows me it's a list
, but when I print get
, it shows me unexpected [<Element script at 0x2a34b88>]
, what's that mean? and how can I extract the actually url of the video instead of Element script
?
finally, I got why I had this problem, thanks @unutbu
xpath=u"//script[contains(.,'label:\"360p\"')]"
should be
xpath=u"//script[contains(.,'label:\"360p\"')]//text()"
which added text()
to make sure return only text, but not elements, under the selection element, notice the //
, that for compatible when there are many sub-elements of the selection.
selector.xpath(xpath)
returns a list of tags (or more accurately, Element
s). When you print a list of objects, Python shows the repr
of those Element
s. <Element script at 0x2a34b88>
is the repr
of the script
Element
.
If elt
is the script
Element
, then elt.text
will return the text inside the <script>
tag, but you'll need to use something else (besides lxml) to extract the url from the text. You could, for example, use the regex pattern r'"(http[^"]+)"'
to search for text which begins with "http
and continues until another double quote, "
, is found:
import re
import lxml.html as LH
url = u"http://hkdramas.se/fashion-war-%E6%BD%AE%E6%B5%81%E6%95%99%E4%B8%BB-episode-20/"
xpath = u"""//script[contains(.,'label:"360p"')]"""
root = LH.parse(url)
for elt in root.xpath(xpath):
for url in re.findall(r'"(http[^"]+)"', elt.text):
print(url)
yields
http://hkdramas.se/wp-content/plugins/BSplugin-version-1.2/lib/grab.php?link1=NS71jbj8NVNANTN7N0Nq7Y7FjeN0NojTN47HNcN77_Nhjh7INm7ONLNijCNc7-7UN_NXNCjcNYjeNwNF7uNQNA7dNvNm7-Nr7vNW7-NtjN72N4jVNCN8NfN-NANm7l7rNP7ff5aa877861da31d8cc9dd087d6ce2417fb1308a676a771b787adbffbaa4a0bffNfNHjtj-N6NDNg7HjLND7F7fjMj.jVjKN1N-jMj7NXj7jNNyjTNwjgjmji7INANtNONsN2NvN6jMNaNTNdNlNON8j7N~NEjO7lNyN.jQNaNuN1NYNjjzNnNENUNmNm7Z707dNaNTNFN0N6N8N.NRNuN_7dNtjhjJN-jmNZNpjjNo7fNHjTNNNSNLjMNqNUjN7IN7NPNfNENKN3jT7dNs&link2=
http://hkdramas.se/wp-content/plugins/BSplugin-version-1.2/lib/grab.php?link1=NvNeNVN4N276Nz7JNSjz7lNLNvNV7Ij3Nx7FNn7.Ni7FNU76NDNMN.NqNkNo7QNKNINiNhjPNJjmNKjPNGN.No7B7BNC7Y7B7B7lN67tjb7JNJNT7rNANrNBN7N6Nt7lN1ND0ba06b7bac4bab5fbb42dbff6c27647ea71b4f725a0c73f175eadf3b459424edN0NBNvNZj77wNL7Wj_j_71NnN0jpNfjPNqNvjDN.jEN4NRNDjijejmjXNINqNijEjENKNfNdN3jiNDNOjcNyN4NwNzN4NqNlNqNAjDNQNBN0Nk7a7Rj8NXN_NiN6NFNmNmNLNwNm7YN7j77vNfNpNljw7HjENRjmNMjVNLNEjq7BN0NON57JNyNyjpN8Nbjz7lN-NfNYNMN.7IjD7.NQ&link2=
Note that you do not need to import urllib2
. You can pass a url directly to LH.parse
.
To get only the url which is followed by the string '360p'
, you could use
for url in re.findall(r'"(http[^"]+).*360p"', elt.text):
print(url)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.