我的 xpath 表达式有什么问题？

Question

I want to extract all the links in td whose class is u-ctitle.我想提取 td 中所有类为 u-ctitle 的链接。

import os
import urllib
import lxml.html
down='http://v.163.com/special/opencourse/bianchengdaolun.html'
file=urllib.urlopen(down).read()
root=lxml.html.document_fromstring(file)
namelist=root.xpath('//td[@class="u-ctitle"]/a')
len(namelist)

The output is [],there are so many td whose classis "u-ctitle" ,with firebug you ca get, why can't extract it?输出是[]，有这么多td的类是“u-ctitle”，用firebug你可以得到，为什么不能提取呢？

My python version is 2.7.9.我的 python 版本是 2.7.9。

It is no use to change file into other name.将文件更改为其他名称是没有用的。

Answer 1

Your XPath is correct.您的 XPath 是正确的。 The problem is unrelated.问题是无关的。

If you examine HTML, you will see following meta tag:如果您检查 HTML，您将看到以下元标记：

<meta http-equiv="Content-Type" content="text/html; charset=GBK" />

And in this code:在这段代码中：

file=urllib.urlopen(down).read()
root=lxml.html.document_fromstring(file)

file is actually a bytes sequence, so decoding from GBK-encoded bytes to Unicode string is happening inside document_fromstring method. file实际上是一个字节序列，因此从 GBK 编码的字节到 Unicode 字符串的解码发生在document_fromstring方法中。

The problem is, HTML encoding is not actually GBK and lxml decodes it incorrectly, leading to loss of data.问题是，HTML 编码实际上不是 GBK，lxml 对其进行了错误的解码，导致数据丢失。

>>> file.decode('gbk')
Traceback (most recent call last):
  File "down.py", line 9, in <module>
    file.decode('gbk')
UnicodeDecodeError: 'gbk' codec can't decode bytes in position 7247-7248: illegal multibyte sequence

After some trial and error, we can find that actual encoding is GB_18030 .经过反复试验，我们可以发现实际的编码是GB_18030 。 To make script work, you need to decode bytes manually:要使脚本工作，您需要手动解码字节：

root=lxml.html.document_fromstring(file.decode('GB18030'))

我的 xpath 表达式有什么问题？

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-01-26 14:42:53

我的 xpath 表达式有什么问题？

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-01-26 14:42:53

解决方案1
1 已采纳 2017-01-26 14:42:53