简体   繁体   English

提取HTML文件的内容

[英]Extract content of a HTML-file

I've got a HTML-file which looks like this (simplified): 我有一个看起来像这样的HTML文件(简化):

<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style="table-layout:fixed; width:325.68pt; height:528.96pt;">
Here is some text.
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style=" width:50.88pt; height:77.28pt;">
Here is another text which ends right here.
</table>
Here are also some words...
</table>

What I'd like to extract is the content of "table class="main"", so in explicit words, I'd like to extract the same as it is written above to a file. 我要提取的是“表类=“ main”“的内容,因此,我想用明确的词来提取与上面写到文件中的内容相同的内容。 Consider: The example is simplified; 考虑:该示例已简化; around the -tags, there are many others... I tried to extract the content using the following code: 在-tags周围,还有很多其他...我试图使用以下代码提取内容:

root = lxml.html.parse('www.test.xyz').getroot()

for empty in root.xpath('//*[self::b or self::i][not(node())]'):
    empty.getparent().remove(empty)

tables = root.cssselect('table.main')

The above code works. 上面的代码有效。 But the problem is that I got a part twice; 但是问题是我得到了两次。 see what I mean: The result of the code is: 明白我的意思:代码的结果是:

<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style="table-layout:fixed; width:325.68pt; height:528.96pt;">
Here is some text.
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style=" width:50.88pt; height:77.28pt;">
Here is another text which ends right here.
</table>
Here are also some words...
</table>
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style=" width:50.88pt; height:77.28pt;">
Here is another text which ends right here.
</table>

So the problem is that the middle part appears one time too much at the end. 因此,问题在于中间部分最后出现的次数过多。 Why is this and how can this be omitted and fixed? 为什么会这样,又如何忽略和解决呢?

paul t., also a stackoverflow-user, told me to use "root.xpath('//table[@class="main" and not(.//table[@class="main"])]')". 也是一个stackoverflow用户的保罗t。告诉我使用“ root.xpath('// table [@ class =” main“而不是(.//table[@class="main”])]')“) 。 This code prints out exactly the part I have twice. 这段代码准确地打印了我两次的部分。

I hope the problem is described clearly enough...thanks for any help and any propositions :) 我希望对问题的描述足够清楚...感谢您的帮助和建议:)

You want to select all the tables with class "main" which are not already selected as descendants of the same elements. 您要选择所有尚未选择为“ main”类的表作为同一元素的后代。
This seems to work fine: 这似乎工作正常:

root.xpath('//table[@class="main" and not(ancestor::table[@class="main"])]')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将远程 html 文件(AWS 存储桶)中的 html 内容渲染到 django-template - Render html-content from remote html-file (AWS bucket) into django-template 如何使用lxml解析从html文件中打印出所有文本信息? - How to print out all the text information from an html-file with lxml parsing? 将字符串从html文件中的表单传递到遵守utf-8编码的python脚本 - Passing a string from a form in html-file to a python-script respecting utf-8 encoding 从html内容中提取数据 - extract data from html content 当我运行 main.py 文件时,为什么我的图片没有显示在我的网站上? (当我运行我的 html 文件时它正在工作) - Why isn't my picture shown on my website when I am running my main.py-file? (It is working when I'm running my html-file) BeautifulSoup帮助,如何从html文件中不正确的标签文本中提取内容? - BeautifulSoup help, how to extract content from not proper tags text in html file? 如何使用python提取动态html内容 - How to extract dynamic html content using python 如何使用python HTMLParser提取HTML标签内容 - How to extract HTML tag content with python HTMLParser 如何从HTML字符串中提取内容 - How to extract content from HTML strings Python从html中提取斜体内容 - Python extract italic content from html
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM