提取HTML文件的内容

Question

I've got a HTML-file which looks like this (simplified): 我有一个看起来像这样的HTML文件（简化）：

<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style="table-layout:fixed; width:325.68pt; height:528.96pt;">
Here is some text.
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style=" width:50.88pt; height:77.28pt;">
Here is another text which ends right here.
</table>
Here are also some words...
</table>

What I'd like to extract is the content of "table class="main"", so in explicit words, I'd like to extract the same as it is written above to a file. 我要提取的是“表类=“ main”“的内容，因此，我想用明确的词来提取与上面写到文件中的内容相同的内容。 Consider: The example is simplified; 考虑：该示例已简化； around the -tags, there are many others... I tried to extract the content using the following code: 在-tags周围，还有很多其他...我试图使用以下代码提取内容：

root = lxml.html.parse('www.test.xyz').getroot()

for empty in root.xpath('//*[self::b or self::i][not(node())]'):
    empty.getparent().remove(empty)

tables = root.cssselect('table.main')

The above code works. 上面的代码有效。 But the problem is that I got a part twice; 但是问题是我得到了两次。 see what I mean: The result of the code is: 明白我的意思：代码的结果是：

<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style="table-layout:fixed; width:325.68pt; height:528.96pt;">
Here is some text.
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style=" width:50.88pt; height:77.28pt;">
Here is another text which ends right here.
</table>
Here are also some words...
</table>
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style=" width:50.88pt; height:77.28pt;">
Here is another text which ends right here.
</table>

So the problem is that the middle part appears one time too much at the end. 因此，问题在于中间部分最后出现的次数过多。 Why is this and how can this be omitted and fixed? 为什么会这样，又如何忽略和解决呢？

paul t., also a stackoverflow-user, told me to use "root.xpath('//table[@class="main" and not(.//table[@class="main"])]')". 也是一个stackoverflow用户的保罗t。告诉我使用“ root.xpath（'// table [@ class =” main“而不是（.//table[@class="main”]）]'）“）。 This code prints out exactly the part I have twice. 这段代码准确地打印了我两次的部分。

I hope the problem is described clearly enough...thanks for any help and any propositions :) 我希望对问题的描述足够清楚...感谢您的帮助和建议：）

Answer 1

You want to select all the tables with class "main" which are not already selected as descendants of the same elements. 您要选择所有尚未选择为“ main”类的表作为同一元素的后代。
This seems to work fine: 这似乎工作正常：

root.xpath('//table[@class="main" and not(ancestor::table[@class="main"])]')

提取HTML文件的内容

问题描述

1 个解决方案

解决方案1
1 已采纳 2013-08-31 17:08:14

提取HTML文件的内容

问题描述

1 个解决方案

解决方案1 1 已采纳 2013-08-31 17:08:14

解决方案1
1 已采纳 2013-08-31 17:08:14