简体   繁体   English

XPATH-如何获取散乱的内部文本数据 <br> 标签?

[英]XPATH - how to get inner text data littered with <br> tags?

I have HTML text like this 我有这样的HTML文本

<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>


<othertag>
 data
</othertag>
<moretag>
 data
</moretag>

I'm trying to query the following with XPATH 我正在尝试使用XPATH查询以下内容

//p//text() | //othertag//text() | //moretag//text()

which gives me text which is broken at the point of each <br> tag 这给我的文本在每个<br>标记处都被打断了

like this 像这样

('This is some important data','Even this is data','this is useful too','othetag text data','moretag text data')

I'd like it as a complete string, 我希望它是一个完整的字符串,

('This is some important data Even this is data this is useful too')

because i'll be querying other elements using | 因为我将使用|查询其他元素 Union XPATH operators and its very important this text content is properly divided 联合XPATH运算符及其非常重要的一点是,正确划分此文本内容

How can i do this? 我怎样才能做到这一点?

If this is impossible, 如果这不可能

can i atleast get the inner HTML of <p> somehow 我可以至少以某种方式获取<p>的内部HTML吗

So that i can textually store it as 这样我就可以将其文本存储为

This is some important data<br>Even this is data<br>this is useful too

I'm using lxml.html in Python 2.7 我在Python 2.7使用lxml.html

Update 更新

Based on your edit, maybe you can use the XPath string() function. 根据您的编辑,也许您可​​以使用XPath string()函数。 For example: 例如:

>>> doc.xpath('string(//p)')
'\n    This is some important data\n    \n    Even this is data\n    \n    this is useful too\n  '

(original answer follows) (以下为原始答案)

If you're getting back the text you want in multiple pieces: 如果要取回多个文本,请:

('This is some important data','Even this is data','this is useful too')

Why not just join those pieces? 为什么不加入这些作品?

>>> ' '.join(doc.xpath('//p/text()'))
['\n    This is some important data\n    ', '\n    Even this is data\n    ', '\n    this is useful too\n  ']

You can even get rid of the line breaks: 您甚至可以摆脱换行符:

>>> ' '.join(x.strip() for x in doc.xpath('//p/text()'))
'This is some important data Even this is data this is useful too'

If you wanted the "inner html" of the p element, you could call lxml.etree.tostring on all of it's children: 如果希望使用p元素的“内部html”,则可以在所有子元素上调用lxml.etree.tostring

>>> ''.join(etree.tostring(x) for x in doc.xpath('//p')[0].getchildren())
'<br/>\n    Even this is data\n    <br/>\n    this is useful too\n  '

NB: All of these examples assume: 注意:所有这些示例均假定:

>>> from lxml import etree
>>> doc = etree.parse(open('myfile.html'),
...    parser=etree.HTMLParser())

You can also expose your own functions in XPath: 您还可以在XPath中公开自己的函数:

import lxml.html, lxml.etree

raw_doc = '''
<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>
'''

doc = lxml.html.fromstring(raw_doc)
ns = lxml.etree.FunctionNamespace(None)

def cat(context, a):
    return [''.join(a)]
ns['cat'] = cat

print repr(doc.xpath('cat(//p/text())'))

which prints 哪个打印

'\nThis is some important data\n\nEven this is data\n\nthis is useful too\n'

You can perform the transformations however you like using this method. 您可以根据需要使用此方法执行转换。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM