简体   繁体   English

使用python lxml + xpath从页面获取视频,获取列表,但无法打印出结果?

[英]using python lxml+xpath to get videos from a page, get a list but can't print out the result?

Newbie for python, would like use lxml+xpath to get video link from web page, what I have now is: 适用于python的新手,想使用lxml + xpath从网页获取视频链接,我现在拥有的是:

import urllib2
from lxml import etree

url=u"http://hkdramas.se/fashion-war-%E6%BD%AE%E6%B5%81%E6%95%99%E4%B8%BB-episode-20/"
xpath=u"//script[contains(.,'label:\"360p\"')]"

html=urllib2.urlopen(url).read()
selector=etree.HTML(html)
get=selector.xpath(xpath)

print get

I've checke type() of get , which shows me it's a list , but when I print get , it shows me unexpected [<Element script at 0x2a34b88>] , what's that mean? 我已经checke type()get ,这说明我是一个list ,但是当我print get ,它让我意想不到[<Element script at 0x2a34b88>]那是什么意思? and how can I extract the actually url of the video instead of Element script ? 以及如何提取视频的实际URL而不是Element script


finally, I got why I had this problem, thanks @unutbu 最后,我明白了为什么会有这个问题,谢谢@unutbu

xpath=u"//script[contains(.,'label:\"360p\"')]"

should be 应该

xpath=u"//script[contains(.,'label:\"360p\"')]//text()"

which added text() to make sure return only text, but not elements, under the selection element, notice the // , that for compatible when there are many sub-elements of the selection. 它在选择元素下添加了text()以确保仅返回文本,但不返回元素,请注意// ,当选择的许多子元素兼容时, //兼容。

selector.xpath(xpath) returns a list of tags (or more accurately, Element s). selector.xpath(xpath)返回标签列表(或更准确地说,是Element )。 When you print a list of objects, Python shows the repr of those Element s. 当您打印对象列表时,Python会显示那些Elementrepr <Element script at 0x2a34b88> is the repr of the script Element . <Element script at 0x2a34b88>script Elementrepr

If elt is the script Element , then elt.text will return the text inside the <script> tag, but you'll need to use something else (besides lxml) to extract the url from the text. 如果eltscript Element ,则elt.text将返回<script>标记内的文本,但是您需要使用其他内容(lxml除外)从文本中提取URL。 You could, for example, use the regex pattern r'"(http[^"]+)"' to search for text which begins with "http and continues until another double quote, " , is found: 例如,您可以使用正则表达式模式r'"(http[^"]+)"'搜索以"http开头并一直持续到找到另一个双引号"文本:

import re
import lxml.html as LH

url = u"http://hkdramas.se/fashion-war-%E6%BD%AE%E6%B5%81%E6%95%99%E4%B8%BB-episode-20/"
xpath = u"""//script[contains(.,'label:"360p"')]"""
root = LH.parse(url)
for elt in root.xpath(xpath):
    for url in re.findall(r'"(http[^"]+)"', elt.text):
        print(url)

yields 产量

http://hkdramas.se/wp-content/plugins/BSplugin-version-1.2/lib/grab.php?link1=NS71jbj8NVNANTN7N0Nq7Y7FjeN0NojTN47HNcN77_Nhjh7INm7ONLNijCNc7-7UN_NXNCjcNYjeNwNF7uNQNA7dNvNm7-Nr7vNW7-NtjN72N4jVNCN8NfN-NANm7l7rNP7ff5aa877861da31d8cc9dd087d6ce2417fb1308a676a771b787adbffbaa4a0bffNfNHjtj-N6NDNg7HjLND7F7fjMj.jVjKN1N-jMj7NXj7jNNyjTNwjgjmji7INANtNONsN2NvN6jMNaNTNdNlNON8j7N~NEjO7lNyN.jQNaNuN1NYNjjzNnNENUNmNm7Z707dNaNTNFN0N6N8N.NRNuN_7dNtjhjJN-jmNZNpjjNo7fNHjTNNNSNLjMNqNUjN7IN7NPNfNENKN3jT7dNs&link2=
http://hkdramas.se/wp-content/plugins/BSplugin-version-1.2/lib/grab.php?link1=NvNeNVN4N276Nz7JNSjz7lNLNvNV7Ij3Nx7FNn7.Ni7FNU76NDNMN.NqNkNo7QNKNINiNhjPNJjmNKjPNGN.No7B7BNC7Y7B7B7lN67tjb7JNJNT7rNANrNBN7N6Nt7lN1ND0ba06b7bac4bab5fbb42dbff6c27647ea71b4f725a0c73f175eadf3b459424edN0NBNvNZj77wNL7Wj_j_71NnN0jpNfjPNqNvjDN.jEN4NRNDjijejmjXNINqNijEjENKNfNdN3jiNDNOjcNyN4NwNzN4NqNlNqNAjDNQNBN0Nk7a7Rj8NXN_NiN6NFNmNmNLNwNm7YN7j77vNfNpNljw7HjENRjmNMjVNLNEjq7BN0NON57JNyNyjpN8Nbjz7lN-NfNYNMN.7IjD7.NQ&link2=

Note that you do not need to import urllib2 . 请注意,您不需要导入urllib2 You can pass a url directly to LH.parse . 您可以将URL直接传递给LH.parse


To get only the url which is followed by the string '360p' , you could use 要仅获取URL,后跟字符串'360p' ,则可以使用

for url in re.findall(r'"(http[^"]+).*360p"', elt.text):
    print(url)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM