XPATH - 如何从内部获取 html 数据 标签？

Question

This question has been asked before,这个问题以前有人问过，

This is HTML data这是 HTML 数据

<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>


<othertag>
 othertag data
</othertag>
<moretag>
 moretag data
</moretag>
....
repeating n times
....

My goal is to extract the data inside  without being splitted by the   tags along with other data我的目标是提取内的数据，而不会被 标签与其他数据分开

This is my query这是我的查询

//p//text() | //othertag//text() | //moretag//text()

This gave这给了

('This is is some important data', 'even this data', 'this is useful too',
'othertag data','moretag data')

Notice above that the  tag text data has been split in the output?请注意上面的标签文本数据已在输出中拆分？

I'd want it formatted as a proper unit like below ,我希望它格式化为一个适当的单位，如下所示，

('This is is some important data even this data this is useful too',
'othertag data','moretag data')

If impossible, can i get it atleast this way?如果不可能，我至少可以通过这种方式获得吗？

('This is is some important <br> data even this data <br> this is useful too',
'othertag data','moretag data')

I cannot use a join statement because it would be hard to selectively join variable list values in variable indexes (No one can predict how many   tags would be there and therefore the data may get split variable times)我不能使用join语句，因为很难在变量索引中选择性地连接变量列表值（没有人可以预测有多少 标签，因此数据可能会拆分变量时间）

My Attempts (with help from other users)我的尝试（在其他用户的帮助下）

string(//p//text()) | //othertag//text() | //moretag//text()

Above Query Gives XPATH Error上面的查询给出了 XPATH 错误

This one as well,还有这个，

import lxml.html, lxml.etree

    ns = lxml.etree.FunctionNamespace(None)

    def cat(context, a):
        return ''.join(a)
    ns['cat'] = cat

This query as well gave InvalidType Error此查询也给出了InvalidType错误

cat(//p//text()) | //othertag//text() | //moretag//text()

I'm using python 2.7我正在使用 python 2.7

Answer 1

If you are open to using other libraries, then you can use BeautifulSoup for this.如果您愿意使用其他库，那么您可以为此使用BeautifulSoup 。

Demo -演示 -

>>> s = """<p>
... This is some important data
... <br>
... Even this is data
... <br>
... this is useful too
... </p>
...
...
... <othertag>
...  othertag data
... </othertag>
... <moretag>
...  moretag data
... </moretag>"""

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(s,'html.parser')

>>> soup.find('p').text
'\nThis is some important data\n\nEven this is data\n\nthis is useful too\n'

>>> print(soup.find('p').text)

This is some important data

Even this is data

this is useful too

Answer 2

You can try using the following custom XPath function :您可以尝试使用以下自定义 XPath 函数：

demo codes :演示代码：

import lxml.html, lxml.etree

source = '''your html here'''
doc = lxml.html.fromstring(source)
ns = lxml.etree.FunctionNamespace(None)

def cat(context, elements):
    return [''.join(e.xpath('.//text()')) for e in elements]
ns['concat-texts'] = cat

print repr(doc.xpath('concat-texts(//p)| //othertag//text() | //moretag//text()'))

sample HTML input :示例 HTML 输入：

source = '''
<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>

<p>
foo
<br>
bar
<br>
baz
</p>

<othertag>
 othertag data
</othertag>
<moretag>
 moretag data
</moretag>
'''

output :输出：

['\nThis is some important data\n\nEven this is data\n\nthis is useful too\n', '\nfoo\n\nbar\n\nbaz\n', '\n othertag data\n', '\n moretag data\n']

Answer 3

I know this comes late, but somebody might find it useful still.我知道这来晚了，但有人可能会发现它仍然有用。 The way I got it working is by replacing the br tags in the original html.我让它工作的方式是替换原始 html 中的 br 标签。 It was a bytes object so it had to be decoded and encoded but it worked like a charm:它是一个字节对象，因此必须对其进行解码和编码，但它的作用就像一个魅力：

from lxml import html
import requests

page = request.get("the website you are getting the html from")
content = page.content.decode('utf-8').replace("<br>", " ").encode('utf-8')
tree = html.fromstring(content)

After this, the //p//text()) returned 'This is is some important data even this data this is useful too' which is what you wanted.在此之后，//p//text()) 返回“这是一些重要数据，即使这些数据也很有用”，这正是您想要的。

Answer 4

You say: "I'd want it formatted as a proper unit like below,你说：“我想把它格式化为一个合适的单位，如下所示，

('This is is some important data even this data this is useful too', 'othertag data','moretag data')" ('这是一些重要的数据，即使这个数据也很有用', 'othertag data','moretag data')"

But actually, XPath does not do formatting.但实际上，XPath 不进行格式化。 You're suggesting that you want a sequence of three strings returned;您建议您希望返回一个由三个字符串组成的序列； the formatting is done later.格式化稍后完成。

You're using Python which means, I assume, that you are using XPath 1.0.您使用的是 Python，这意味着，我假设您使用的是 XPath 1.0。 In XPath 1.0, there is no such thing as a sequence of three strings.在 XPath 1.0 中，没有三个字符串的序列这样的东西。 You could return three nodes (the p, othertag, and moretag nodes), and then extracting the string values of these nodes becomes a Python problem rather than an XPath problem.您可以返回三个节点（p、othertag 和 moretag 节点），然后提取这些节点的字符串值成为 Python 问题而不是 XPath 问题。 Or you could return the three strings in three separate calls: for example, string(//p) would give you the string value of the first p element.或者，您可以在三个单独的调用中返回三个字符串：例如，string(//p) 将为您提供第一个 p 元素的字符串值。

In your question you say the data is repeated.在您的问题中，您说数据是重复的。 But you don't say which data is repeated.但是你没有说哪些数据是重复的。 I dont have a clear picture of what your real source document looks like.我对您的真实源文件的外观没有清晰的了解。 That's probably why the answers to your question, including mine, are so incomplete.这可能就是为什么您的问题的答案（包括我的）如此不完整的原因。

XPATH - 如何从内部获取 html 数据<br>标签？

问题描述

4 个解决方案

解决方案1
1 2015-07-28 07:07:38

解决方案2
1 已采纳 2015-07-28 07:21:35

解决方案3
1 2020-08-18 12:43:36

解决方案4
0 2015-07-28 11:23:30

XPATH - 如何从内部获取 html 数据<br>标签？

问题描述

4 个解决方案

解决方案1 1 2015-07-28 07:07:38

解决方案2 1 已采纳 2015-07-28 07:21:35

解决方案3 1 2020-08-18 12:43:36

解决方案4 0 2015-07-28 11:23:30

解决方案1
1 2015-07-28 07:07:38

解决方案2
1 已采纳 2015-07-28 07:21:35

解决方案3
1 2020-08-18 12:43:36

解决方案4
0 2015-07-28 11:23:30