简体   繁体   English

我的 xpath 表达式有什么问题?

[英]What is wrong with my xpath expression?

I want to extract all the links in td whose class is u-ctitle.我想提取 td 中所有类为 u-ctitle 的链接。

import os
import urllib
import lxml.html
down='http://v.163.com/special/opencourse/bianchengdaolun.html'
file=urllib.urlopen(down).read()
root=lxml.html.document_fromstring(file)
namelist=root.xpath('//td[@class="u-ctitle"]/a')
len(namelist)

The output is [],there are so many td whose classis "u-ctitle" ,with firebug you ca get, why can't extract it?输出是[],有这么多td的类是“u-ctitle”,用firebug你可以得到,为什么不能提取呢?

在此处输入图片说明
My python version is 2.7.9.我的 python 版本是 2.7.9。
在此处输入图片说明

It is no use to change file into other name.将文件更改为其他名称是没有用的。

在此处输入图片说明

Your XPath is correct.您的 XPath 是正确的。 The problem is unrelated.问题是无关的。

If you examine HTML, you will see following meta tag:如果您检查 HTML,您将看到以下元标记:

<meta http-equiv="Content-Type" content="text/html; charset=GBK" />

And in this code:在这段代码中:

file=urllib.urlopen(down).read()
root=lxml.html.document_fromstring(file)

file is actually a bytes sequence, so decoding from GBK-encoded bytes to Unicode string is happening inside document_fromstring method. file实际上是一个字节序列,因此从 GBK 编码的字节到 Unicode 字符串的解码发生在document_fromstring方法中。

The problem is, HTML encoding is not actually GBK and lxml decodes it incorrectly, leading to loss of data.问题是,HTML 编码实际上不是 GBK,lxml 对其进行了错误的解码,导致数据丢失。

>>> file.decode('gbk')
Traceback (most recent call last):
  File "down.py", line 9, in <module>
    file.decode('gbk')
UnicodeDecodeError: 'gbk' codec can't decode bytes in position 7247-7248: illegal multibyte sequence

After some trial and error, we can find that actual encoding is GB_18030 .经过反复试验,我们可以发现实际的编码是GB_18030 To make script work, you need to decode bytes manually:要使脚本工作,您需要手动解码字节:

root=lxml.html.document_fromstring(file.decode('GB18030'))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM