ElementTree text mixed with tags

Question

imagine the following text:

<description>
the thing <b>stuff</b> is very important for various reasons, notably <b>other things</b>.
</description>

How would I manage to parse this with the etree interface? Having the description tag, the .text property returns only the first word - the . The .getchildren() method returns the <b> elements, but not the rest of the text.

Many thanks!

Answer 1

Get the .text_content() . Working sample using lxml.html :

from lxml.html import fromstring   

data = """
<description>
the thing <b>stuff</b> is very important for various reasons, notably <b>other things</b>.
</description>
"""

tree = fromstring(data)

print(tree.xpath("//description")[0].text_content().strip())

Prints:

the thing stuff is very important for various reasons, notably other things.

I forgot to specify one thing though, sorry. My ideal parsed version would contain a list of subsections: [normal("the thing"), bold("stuff"), normal("....")], is that possible with the lxml.html library?

Assuming you'll have only text nodes and b elements inside a description:

for item in tree.xpath("//description/*|//description/text()"):
    print([item.strip(), 'normal'] if isinstance(item, basestring) else [item.text, 'bold'])

Prints:

['the thing', 'normal']
['stuff', 'bold']
['is very important for various reasons, notably', 'normal']
['other things', 'bold']
['.', 'normal']

ElementTree text mixed with tags

Question

1 answers

solution1
1 ACCPTED 2015-12-16 18:12:16

ElementTree text mixed with tags

Question

1 answers

solution1 1 ACCPTED 2015-12-16 18:12:16

solution1
1 ACCPTED 2015-12-16 18:12:16