I have HTML text like this
<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>
<othertag>
data
</othertag>
<moretag>
data
</moretag>
I'm trying to query the following with XPATH
//p//text() | //othertag//text() | //moretag//text()
which gives me text which is broken at the point of each <br>
tag
like this
('This is some important data','Even this is data','this is useful too','othetag text data','moretag text data')
I'd like it as a complete string,
('This is some important data Even this is data this is useful too')
because i'll be querying other elements using |
Union XPATH operators and its very important this text content is properly divided
How can i do this?
If this is impossible,
can i atleast get the inner HTML of <p>
somehow
So that i can textually store it as
This is some important data<br>Even this is data<br>this is useful too
I'm using lxml.html
in Python 2.7
Update
Based on your edit, maybe you can use the XPath string()
function. For example:
>>> doc.xpath('string(//p)')
'\n This is some important data\n \n Even this is data\n \n this is useful too\n '
(original answer follows)
If you're getting back the text you want in multiple pieces:
('This is some important data','Even this is data','this is useful too')
Why not just join those pieces?
>>> ' '.join(doc.xpath('//p/text()'))
['\n This is some important data\n ', '\n Even this is data\n ', '\n this is useful too\n ']
You can even get rid of the line breaks:
>>> ' '.join(x.strip() for x in doc.xpath('//p/text()'))
'This is some important data Even this is data this is useful too'
If you wanted the "inner html" of the p
element, you could call lxml.etree.tostring
on all of it's children:
>>> ''.join(etree.tostring(x) for x in doc.xpath('//p')[0].getchildren())
'<br/>\n Even this is data\n <br/>\n this is useful too\n '
NB: All of these examples assume:
>>> from lxml import etree
>>> doc = etree.parse(open('myfile.html'),
... parser=etree.HTMLParser())
You can also expose your own functions in XPath:
import lxml.html, lxml.etree
raw_doc = '''
<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>
'''
doc = lxml.html.fromstring(raw_doc)
ns = lxml.etree.FunctionNamespace(None)
def cat(context, a):
return [''.join(a)]
ns['cat'] = cat
print repr(doc.xpath('cat(//p/text())'))
which prints
'\nThis is some important data\n\nEven this is data\n\nthis is useful too\n'
You can perform the transformations however you like using this method.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.