Easy way to get data between tags of xml or html files in python?

Question

I am using Python and need to find and retrieve all character data between tags:

<tag>I need this stuff</tag>

I then want to output the found data to another file. I am just looking for a very easy and efficient way to do this.

If you can post a quick code snippet to portray the ease of use. Because I am having a bit of trouble understanding the parsers.

Answer 1

without external modules, eg

>>> myhtml = """ <tag>I need this stuff</tag>
... blah blah
... <tag>I need this stuff too
... </tag>
... blah blah """
>>> for item in myhtml.split("</tag>"):
...   if "<tag>" in item:
...       print item [ item.find("<tag>")+len("<tag>") : ]
...
I need this stuff
I need this stuff too

Answer 2

Beautiful Soup is a wonderful HTML/XML parser for Python:

Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:

Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.

Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.

Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding.

Answer 3

I quite like parsing into element tree and then using element.text and element.tail .

It also has xpath like searching

>>> from xml.etree.ElementTree import ElementTree
>>> tree = ElementTree()
>>> tree.parse("index.xhtml")
<Element html at b7d3f1ec>
>>> p = tree.find("body/p")     # Finds first occurrence of tag p in body
>>> p
<Element p at 8416e0c>
>>> p.text
"Some text in the Paragraph"
>>> links = p.getiterator("a")  # Returns list of all links
>>> links
[<Element a at b7d4f9ec>, <Element a at b7d4fb0c>]
>>> for i in links:             # Iterates through all found links
...     i.attrib["target"] = "blank"
>>> tree.write("output.xhtml")

Answer 4

This is how I am doing it:

    (myhtml.split('<tag>')[1]).split('</tag>')[0]

Tell me if it worked!

Answer 5

Use xpath and lxml;

from lxml import etree

pageInMemory = open("pageToParse.html", "r")

parsedPage = etree.HTML(pageInMemory)

yourListOfText = parsedPage.xpath("//tag//text()")

saveFile = open("savedFile", "w")
saveFile.writelines(yourListOfText)

pageInMemory.close()
saveFile.close()

Faster than Beautiful soup.

If you want to test out your Xpath's - I find FireFox's Xpather extremely helpful .

Further Notes:

Answer 6

def value_tag(s):
    i = s.index('>')
    s = s[i+1:]
    i = s.index('<')
    s = s[:i]
    return s

Easy way to get data between tags of xml or html files in python?

Question

6 answers

solution1
7 ACCPTED 2010-01-20 00:00:54

solution2
2 2010-01-19 23:10:44

solution3
2 2010-01-19 23:11:59

solution4
1 2017-08-16 09:45:37

solution5
0 2010-01-20 06:15:05

solution6
0 2017-05-16 23:05:05

Easy way to get data between tags of xml or html files in python?

Question

6 answers

solution1 7 ACCPTED 2010-01-20 00:00:54

solution2 2 2010-01-19 23:10:44

solution3 2 2010-01-19 23:11:59

solution4 1 2017-08-16 09:45:37

solution5 0 2010-01-20 06:15:05

solution6 0 2017-05-16 23:05:05

solution1
7 ACCPTED 2010-01-20 00:00:54

solution2
2 2010-01-19 23:10:44

solution3
2 2010-01-19 23:11:59

solution4
1 2017-08-16 09:45:37

solution5
0 2010-01-20 06:15:05

solution6
0 2017-05-16 23:05:05