在Xpath中连接XML / HTML的后继文本节点

Question

I have this html fragment: 我有这个html片段：

<tr>
    <th scope="row" style="text-align:left;">Appeared in</th>
    <td class="" style="">1972<sup id="cite_ref-dottcl_2_2-0" class="reference"><a href="#cite_note-dottcl_2-2"><span>[</span>2<span>]</span></a></sup></td>
</tr>
<tr>
    <th scope="row" style="text-align:left;">Usual 
<a href="/wiki/Filename_extension" title="Filename extension">filename extensions</a>
    </th>
    <td class="" style="">.h .c</td>

</tr>

I am using //th//text() expression to parse it. 我正在使用//th//text()表达式来解析它。

The problem is it's returning ['Appeared in', 'Usual', 'filename extensions'] . 问题在于它正在返回['Appeared in', 'Usual', 'filename extensions'] 。

What I want is ['Appeared in', 'Usual filename extensions'] . 我想要的是['Appeared in', 'Usual filename extensions'] 。

Answer 1

You need XPath 2.0 for doing this which is not supported by most XML-libraries of those scripting languages (including scrapy). 您需要XPath 2.0来执行此操作，而这些脚本语言（包括scrapy）的大多数XML库都不支持XPath 2.0。

If you can use a more capable XPath processor (also have a look at XQuery 1.0 and newer, they all include at least XPath 2.0 as a subset), use: 如果您可以使用功能更强大的XPath处理器（还可以查看XQuery 1.0和更高版本，它们都至少包含XPath 2.0作为子集），请使用：

//th/data()

/data() is equivalent to /data(.) which calls the function for the current context. /data()等效于/data(.) ，后者为当前上下文调用该函数。

`data()` vs `text()` `data()`与`text()`

While text() is not a function call, but node filter (and thus //text() is an axis step adding all text nodes individually to the result sequence), data() is a function aggregating all data for the current context (here: each <th/> individually). 虽然text()不是函数调用，但是是节点过滤器（因此//text()是将所有文本节点分别添加到结果序列的轴步），而data()是一个函数，用于聚合当前上下文的所有数据（在这里：每个<th/>分别）。

XPath 1.0 limitations XPath 1.0的局限性

There is no way to call any function that concatenates strings for each table header element individually: Function calls in axis steps are not supported, nor are explicit loops like they're possible in XPath 2.0. 无法调用为每个表头元素单独连接字符串的任何函数：不支持轴步中的函数调用，也不支持显式循环，如XPath 2.0中可能的那样。

Answer 2

Ah I will get downvoted for parsing HTML with regex , but can't be helped: 啊，我将因使用regex解析HTML而被淘汰，但无济于事：

$html = '<tr>
    <th scope="row" style="text-align:left;">Appeared in</th>
    <td class="" style="">1972<sup id="cite_ref-dottcl_2_2-0" class="reference"><a href="#cite_note-dottcl_2-2"><span>[</span>2<span>]</span></a></sup></td>
</tr>
<tr>
    <th scope="row" style="text-align:left;">Usual 
<a href="/wiki/Filename_extension" title="Filename extension">filename extensions</a>
    </th>
    <td class="" style="">.h .c</td>

</tr>';

$html = str_replace("\r", '', str_replace("\n", '', $html)); // Remove new lines
preg_match_all('#<th[^>]*>(.*?)</th>#is', $html, $m); // Match what's between th tag

$result = array_map('strip_tags', $m[1]); // Get ride of html tags
print_r($result);// printing the results

Output: 输出：

Array
(
    [0] => Appeared in
    [1] => Usual filename extensions    
)

在Xpath中连接XML / HTML的后继文本节点

问题描述

2 个解决方案

解决方案1
2 已采纳 2013-06-02 19:17:40

`data()` vs `text()` `data()`与`text()`

XPath 1.0 limitations XPath 1.0的局限性

解决方案2
0 2013-06-02 18:49:49

在Xpath中连接XML / HTML的后继文本节点

问题描述

2 个解决方案

解决方案1 2 已采纳 2013-06-02 19:17:40

data() vs text() data()与text()

XPath 1.0 limitations XPath 1.0的局限性

解决方案2 0 2013-06-02 18:49:49

解决方案1
2 已采纳 2013-06-02 19:17:40

`data()` vs `text()` `data()`与`text()`

解决方案2
0 2013-06-02 18:49:49