尝试使用 lxml.html 从网站的某个部分获取文本

Question

I have some current Python code that is supposed to get the HTML from a certain part of a website, using the xpath of where the HTML tag is located.我有一些当前的 Python 代码，应该使用 HTML 标记所在位置的 xpath 从网站的某个部分获取 HTML。

def wordorigins(word):
    pageopen = lxml.html.fromstring("http://www.merriam-webster.com/dictionary/" + str(word))
    pbody = pageopen.xpath("/html/body/div[1]/div/div[4]/div/div[1]/main/article/div[5]/div[3]/div[1]/div/p[1]")
    etybody = lxml.html.fromstring(pbody)
    etytxt = etybody.xpath('text()')
    etytxt = etytxt.replace("<em>", "")
    etytxt = etytxt.replace("</em>", "")
    return etytxt

This code returns this error about expecting a string or a buffer:此代码返回有关期望字符串或缓冲区的错误：

Traceback (most recent call last):
  File "mott.py", line 47, in <module>
    print wordorigins(x)
  File "mott.py", line 30, in wordorigins
    etybody = lxml.html.fromstring(pbody)
  File "/usr/lib/python2.7/site-packages/lxml/html/__init__.py", line 866, in fromstring
    is_full_html = _looks_like_full_html_unicode(html)
TypeError: expected string or buffer

Thoughts?想法？

Answer 1

xpath() method returns a list of results , fromstring() expects a string. xpath()方法返回一个结果列表， fromstring()需要一个字符串。

But, you don't need to reparse the part of the document.但是，您不需要重新解析文档的一部分。 Just use what you've already found:只需使用您已经找到的内容：

def wordorigins(word):
    pageopen = lxml.html.fromstring("http://www.merriam-webster.com/dictionary/" + str(word))
    pbody = pageopen.xpath("/html/body/div[1]/div/div[4]/div/div[1]/main/article/div[5]/div[3]/div[1]/div/p[1]")[0]
    etytxt = pbody.text_content()
    etytxt = etytxt.replace("<em>", "")
    etytxt = etytxt.replace("</em>", "")
    return etytxt

Note that I'm using text_content() method instead of the xpath("text()") .请注意，我使用的是text_content()方法而不是xpath("text()") 。

Answer 2

As mentioned in @alecxe 's answer, the xpath() method returns list of matched elements in this case, hence the error when you tried to pass the list to lxml.html.fromstring() .正如@alecxe的回答中提到的，在这种情况下， xpath()方法返回匹配元素的列表，因此当您尝试将列表传递给lxml.html.fromstring()时会出现错误。 Another thing to note is, that neither XPath's text() function nor lxml 's text_content() method would ever return string containing tag such as <em></em> .另一件要注意的事情是，XPath 的text()函数和lxml的text_content()方法都不会返回包含诸如<em></em>标记的字符串。 They automatically strips tags if any, so the two replace() lines are not needed.如果有标签，它们会自动去除标签，因此不需要两个replace()行。 You can simply use text_content() or XPath's string() function (instead of text() ) :您可以简单地使用text_content()或 XPath 的string()函数（而不是text() ）：

......
# either of the following lines should be enough
etytxt = pbody[0].xpath('string()')
etytxt = pbody[0].text_content()

尝试使用 lxml.html 从网站的某个部分获取文本

问题描述

2 个解决方案

解决方案1
1 已采纳 2016-05-06 05:30:37

解决方案2
1 2016-05-06 06:20:44

尝试使用 lxml.html 从网站的某个部分获取文本

问题描述

2 个解决方案

解决方案1 1 已采纳 2016-05-06 05:30:37

解决方案2 1 2016-05-06 06:20:44

解决方案1
1 已采纳 2016-05-06 05:30:37

解决方案2
1 2016-05-06 06:20:44