在python中获取xml或html文件标签之间的数据的简单方法？

Question

I am using Python and need to find and retrieve all character data between tags: 我正在使用Python，需要在标签之间查找和检索所有字符数据：

<tag>I need this stuff</tag>

I then want to output the found data to another file. 然后我想将找到的数据输出到另一个文件。 I am just looking for a very easy and efficient way to do this. 我只是在寻找一种非常简单有效的方法来做到这一点。

If you can post a quick code snippet to portray the ease of use. 如果您可以发布快速代码段来描述易用性。 Because I am having a bit of trouble understanding the parsers. 因为我在理解解析器时遇到了一些麻烦。

Answer 1

without external modules, eg 没有外部模块，例如

>>> myhtml = """ <tag>I need this stuff</tag>
... blah blah
... <tag>I need this stuff too
... </tag>
... blah blah """
>>> for item in myhtml.split("</tag>"):
...   if "<tag>" in item:
...       print item [ item.find("<tag>")+len("<tag>") : ]
...
I need this stuff
I need this stuff too

Answer 2

Beautiful Soup is a wonderful HTML/XML parser for Python: Beautiful Soup是Python的精彩HTML / XML解析器：

Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Beautiful Soup是一个Python HTML / XML解析器，专为快速周转项目而设计，例如屏幕抓取。 Three features make it powerful: 三个功能使其功能强大：

Beautiful Soup won't choke if you give it bad markup. 如果给它不好的标记，美丽的汤不会窒息。 It yields a parse tree that makes approximately as much sense as your original document. 它产生一个解析树，使其与原始文档几乎一样有意义。 This is usually good enough to collect the data you need and run away. 这通常足以收集您需要的数据并逃跑。

Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. Beautiful Soup提供了一些简单的方法和Pythonic习语，用于导航，搜索和修改解析树：用于剖析文档和提取所需内容的工具包。 You don't have to create a custom parser for each application. 您不必为每个应用程序创建自定义解析器。

Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. Beautiful Soup会自动将传入的文档转换为Unicode，将传出的文档转换为UTF-8。 You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. 您不必考虑编码，除非文档没有指定编码并且Beautiful Soup不能自动检测编码。 Then you just have to specify the original encoding. 然后你只需要指定原始编码。

Answer 3

I quite like parsing into element tree and then using element.text and element.tail . 我非常喜欢解析元素树，然后使用element.text和element.tail 。

It also has xpath like searching 它也有像搜索一样的xpath

>>> from xml.etree.ElementTree import ElementTree
>>> tree = ElementTree()
>>> tree.parse("index.xhtml")
<Element html at b7d3f1ec>
>>> p = tree.find("body/p")     # Finds first occurrence of tag p in body
>>> p
<Element p at 8416e0c>
>>> p.text
"Some text in the Paragraph"
>>> links = p.getiterator("a")  # Returns list of all links
>>> links
[<Element a at b7d4f9ec>, <Element a at b7d4fb0c>]
>>> for i in links:             # Iterates through all found links
...     i.attrib["target"] = "blank"
>>> tree.write("output.xhtml")

Answer 4

This is how I am doing it: 这就是我这样做的方式：

    (myhtml.split('<tag>')[1]).split('</tag>')[0]

Tell me if it worked! 告诉我它是否有效！

Answer 5

Use xpath and lxml; 使用xpath和lxml;

from lxml import etree

pageInMemory = open("pageToParse.html", "r")

parsedPage = etree.HTML(pageInMemory)

yourListOfText = parsedPage.xpath("//tag//text()")

saveFile = open("savedFile", "w")
saveFile.writelines(yourListOfText)

pageInMemory.close()
saveFile.close()

Faster than Beautiful soup. 比美丽的汤更快。

If you want to test out your Xpath's - I find FireFox's Xpather extremely helpful . 如果你想测试你的Xpath - 我发现FireFox的Xpather非常有帮助。

Further Notes: 进一步说明：

Answer 6

def value_tag(s):
    i = s.index('>')
    s = s[i+1:]
    i = s.index('<')
    s = s[:i]
    return s

在python中获取xml或html文件标签之间的数据的简单方法？

问题描述

6 个解决方案

解决方案1
7 已采纳 2010-01-20 00:00:54

解决方案2
2 2010-01-19 23:10:44

解决方案3
2 2010-01-19 23:11:59

解决方案4
1 2017-08-16 09:45:37

解决方案5
0 2010-01-20 06:15:05

解决方案6
0 2017-05-16 23:05:05

在python中获取xml或html文件标签之间的数据的简单方法？

问题描述

6 个解决方案

解决方案1 7 已采纳 2010-01-20 00:00:54

解决方案2 2 2010-01-19 23:10:44

解决方案3 2 2010-01-19 23:11:59

解决方案4 1 2017-08-16 09:45:37

解决方案5 0 2010-01-20 06:15:05

解决方案6 0 2017-05-16 23:05:05

解决方案1
7 已采纳 2010-01-20 00:00:54

解决方案2
2 2010-01-19 23:10:44

解决方案3
2 2010-01-19 23:11:59

解决方案4
1 2017-08-16 09:45:37

解决方案5
0 2010-01-20 06:15:05

解决方案6
0 2017-05-16 23:05:05