XPATH-如何获取散乱的内部文本数据 标签？

Question

I have HTML text like this 我有这样的HTML文本

<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>


<othertag>
 data
</othertag>
<moretag>
 data
</moretag>

I'm trying to query the following with XPATH 我正在尝试使用XPATH查询以下内容

//p//text() | //othertag//text() | //moretag//text()

which gives me text which is broken at the point of each   tag 这给我的文本在每个 标记处都被打断了

like this 像这样

('This is some important data','Even this is data','this is useful too','othetag text data','moretag text data')

I'd like it as a complete string, 我希望它是一个完整的字符串，

('This is some important data Even this is data this is useful too')

because i'll be querying other elements using | 因为我将使用|查询其他元素 Union XPATH operators and its very important this text content is properly divided 联合XPATH运算符及其非常重要的一点是，正确划分此文本内容

How can i do this? 我怎样才能做到这一点？

If this is impossible, 如果这不可能

can i atleast get the inner HTML of  somehow 我可以至少以某种方式获取的内部HTML吗

So that i can textually store it as 这样我就可以将其文本存储为

This is some important data<br>Even this is data<br>this is useful too

I'm using lxml.html in Python 2.7 我在Python 2.7使用lxml.html

Answer 1

Update 更新

Based on your edit, maybe you can use the XPath string() function. 根据您的编辑，也许您可以使用XPath string()函数。 For example: 例如：

>>> doc.xpath('string(//p)')
'\n    This is some important data\n    \n    Even this is data\n    \n    this is useful too\n  '

(original answer follows) （以下为原始答案）

If you're getting back the text you want in multiple pieces: 如果要取回多个文本，请：

('This is some important data','Even this is data','this is useful too')

Why not just join those pieces? 为什么不加入这些作品？

>>> ' '.join(doc.xpath('//p/text()'))
['\n    This is some important data\n    ', '\n    Even this is data\n    ', '\n    this is useful too\n  ']

You can even get rid of the line breaks: 您甚至可以摆脱换行符：

>>> ' '.join(x.strip() for x in doc.xpath('//p/text()'))
'This is some important data Even this is data this is useful too'

If you wanted the "inner html" of the p element, you could call lxml.etree.tostring on all of it's children: 如果希望使用p元素的“内部html”，则可以在所有子元素上调用lxml.etree.tostring ：

>>> ''.join(etree.tostring(x) for x in doc.xpath('//p')[0].getchildren())
'<br/>\n    Even this is data\n    <br/>\n    this is useful too\n  '

NB: All of these examples assume: 注意：所有这些示例均假定：

>>> from lxml import etree
>>> doc = etree.parse(open('myfile.html'),
...    parser=etree.HTMLParser())

Answer 2

You can also expose your own functions in XPath: 您还可以在XPath中公开自己的函数：

import lxml.html, lxml.etree

raw_doc = '''
<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>
'''

doc = lxml.html.fromstring(raw_doc)
ns = lxml.etree.FunctionNamespace(None)

def cat(context, a):
    return [''.join(a)]
ns['cat'] = cat

print repr(doc.xpath('cat(//p/text())'))

which prints 哪个打印

'\nThis is some important data\n\nEven this is data\n\nthis is useful too\n'

You can perform the transformations however you like using this method. 您可以根据需要使用此方法执行转换。

XPATH-如何获取散乱的内部文本数据 <br> 标签？

问题描述

2 个解决方案

解决方案1
2 2015-07-27 14:09:56

解决方案2
2 2015-07-27 14:24:27

XPATH-如何获取散乱的内部文本数据 <br> 标签？

问题描述

2 个解决方案

解决方案1 2 2015-07-27 14:09:56

解决方案2 2 2015-07-27 14:24:27

解决方案1
2 2015-07-27 14:09:56

解决方案2
2 2015-07-27 14:24:27