简体   繁体   English

使用Python LXML XPath收集数据

[英]Scraping data with Python LXML XPath

I am trying to parse a website for 我正在尝试解析一个网站

blahblahblah 
<a  href="THIS IS WHAT I WANT" title="NOT THIS">I DONT CARE ABOUT THIS EITHER</a>
blahblahblah 

(there are many of these, and I want all of them in some tokenized form). (其中有很多,我希望它们都以某种标记形式出现)。 The problem is that "a href" actually has two spaces, not just one (there are some that are "a href" with one space that I do NOT want to retrieve), so using tree.xpath('//a/@href') doesn't quite work. 问题是“ a href”实际上有两个空格,而不仅仅是一个空格(有些“ a href”带有一个我不想检索的空格),所以使用tree.xpath('// a / @ href')不太有效。 Does anyone have any suggestions on what to do? 有人对做什么有建议吗?

Thanks! 谢谢!

不了解LXML,但是您可以肯定地使用BeautifulSoup,在页面上找到所有<a> ,然后创建for循环,在该循环中,您将检查<a href=...>与您的正则表达式模式匹配。匹配,而不是报废网址。

This code works as expected : 此代码按预期工作:

from lxml import etree

file = "file:///path/to/file.html" # can be a http URL too
doc = etree.parse(file)

print doc.xpath('//a/@href')[0]

Edit : it's not possible AFAIK to do what you want with lxml . 编辑:AFAIK不可能用lxml做你想做的事情。

You can use a instead. 您可以改用

"(there are some that are "a href" with one space that I do NOT want to retrieve)" “((有些是“ a href”,但有一个我不想检索的空格)”

I think this means that you only want to locate elements where there is more than one space between the a and the href. 我认为这意味着您只想定位a和href之间有多个空格的元素。 XML allows any amount of whitespace between tag name and attribute (spaces, tabs, new lines are all allowed). XML允许在标记名称和属性之间使用任意数量的空格(允许使用空格,制表符和换行符)。 The whitespace is discarded by the time the text is parsed and the document tree is created. 在解析文本和创建文档树时,空白将被丢弃。 LXML and XPATH are working with Node objects in the Document tree, not the original text that was parsed to make the tree. LXML和XPATH与文档树中的Node对象一起使用,而不是与构成该树的原始文本一起使用。

One option is to use regular expressions to find the text you want. 一种选择是使用正则表达式来查找所需的文本。 But really, since this is perfectly valid XML/HTML, why bother to remove a few spaces? 但是实际上,由于这是完全有效的XML / HTML,为什么还要麻烦删除一些空格?

Use an xpath expression to find all the nodes, then iterate through all those nodes looking for a match, you can obtain a string representation of the node with : 使用xpath表达式查找所有节点,然后遍历所有这些节点以寻找匹配项,您可以使用以下命令获取该节点的字符串表示形式:

etree.tostring(node)

For futher reference : http://lxml.de/tutorial.html#elements-carry-attributes-as-a-dict 有关更多参考: http : //lxml.de/tutorial.html#elements-carry-attributes-as-a-dict

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM