简体   繁体   中英

XPATH - how to get inner text data littered with <br> tags?

I have HTML text like this

<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>


<othertag>
 data
</othertag>
<moretag>
 data
</moretag>

I'm trying to query the following with XPATH

//p//text() | //othertag//text() | //moretag//text()

which gives me text which is broken at the point of each <br> tag

like this

('This is some important data','Even this is data','this is useful too','othetag text data','moretag text data')

I'd like it as a complete string,

('This is some important data Even this is data this is useful too')

because i'll be querying other elements using | Union XPATH operators and its very important this text content is properly divided

How can i do this?

If this is impossible,

can i atleast get the inner HTML of <p> somehow

So that i can textually store it as

This is some important data<br>Even this is data<br>this is useful too

I'm using lxml.html in Python 2.7

Update

Based on your edit, maybe you can use the XPath string() function. For example:

>>> doc.xpath('string(//p)')
'\n    This is some important data\n    \n    Even this is data\n    \n    this is useful too\n  '

(original answer follows)

If you're getting back the text you want in multiple pieces:

('This is some important data','Even this is data','this is useful too')

Why not just join those pieces?

>>> ' '.join(doc.xpath('//p/text()'))
['\n    This is some important data\n    ', '\n    Even this is data\n    ', '\n    this is useful too\n  ']

You can even get rid of the line breaks:

>>> ' '.join(x.strip() for x in doc.xpath('//p/text()'))
'This is some important data Even this is data this is useful too'

If you wanted the "inner html" of the p element, you could call lxml.etree.tostring on all of it's children:

>>> ''.join(etree.tostring(x) for x in doc.xpath('//p')[0].getchildren())
'<br/>\n    Even this is data\n    <br/>\n    this is useful too\n  '

NB: All of these examples assume:

>>> from lxml import etree
>>> doc = etree.parse(open('myfile.html'),
...    parser=etree.HTMLParser())

You can also expose your own functions in XPath:

import lxml.html, lxml.etree

raw_doc = '''
<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>
'''

doc = lxml.html.fromstring(raw_doc)
ns = lxml.etree.FunctionNamespace(None)

def cat(context, a):
    return [''.join(a)]
ns['cat'] = cat

print repr(doc.xpath('cat(//p/text())'))

which prints

'\nThis is some important data\n\nEven this is data\n\nthis is useful too\n'

You can perform the transformations however you like using this method.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM