简体   繁体   English

在Xpath中连接XML / HTML的后继文本节点

[英]Joining decendent text nodes of XML/HTML in Xpath

I have this html fragment: 我有这个html片段:

<tr>
    <th scope="row" style="text-align:left;">Appeared in</th>
    <td class="" style="">1972<sup id="cite_ref-dottcl_2_2-0" class="reference"><a href="#cite_note-dottcl_2-2"><span>[</span>2<span>]</span></a></sup></td>
</tr>
<tr>
    <th scope="row" style="text-align:left;">Usual 
<a href="/wiki/Filename_extension" title="Filename extension">filename extensions</a>
    </th>
    <td class="" style="">.h .c</td>

</tr>

I am using //th//text() expression to parse it. 我正在使用//th//text()表达式来解析它。

The problem is it's returning ['Appeared in', 'Usual', 'filename extensions'] . 问题在于它正在返回['Appeared in', 'Usual', 'filename extensions']

What I want is ['Appeared in', 'Usual filename extensions'] . 我想要的是['Appeared in', 'Usual filename extensions']

You need XPath 2.0 for doing this which is not supported by most XML-libraries of those scripting languages (including scrapy). 您需要XPath 2.0来执行此操作,而这些脚本语言(包括scrapy)的大多数XML库都不支持XPath 2.0。

If you can use a more capable XPath processor (also have a look at XQuery 1.0 and newer, they all include at least XPath 2.0 as a subset), use: 如果您可以使用功能更强大的XPath处理器(还可以查看XQuery 1.0和更高版本,它们都至少包含XPath 2.0作为子集),请使用:

//th/data()

/data() is equivalent to /data(.) which calls the function for the current context. /data()等效于/data(.) ,后者为当前上下文调用该函数。

data() vs text() data()text()

While text() is not a function call, but node filter (and thus //text() is an axis step adding all text nodes individually to the result sequence), data() is a function aggregating all data for the current context (here: each <th/> individually). 虽然text()不是函数调用,但是是节点过滤器(因此//text()是将所有文本节点分别添加到结果序列的轴步),而data()是一个函数,用于聚合当前上下文的所有数据(在这里:每个<th/>分别)。

XPath 1.0 limitations XPath 1.0的局限性

There is no way to call any function that concatenates strings for each table header element individually: Function calls in axis steps are not supported, nor are explicit loops like they're possible in XPath 2.0. 无法调用为每个表头元素单独连接字符串的任何函数:不支持轴步中的函数调用,也不支持显式循环,如XPath 2.0中可能的那样。

Ah I will get downvoted for parsing HTML with , but can't be helped: 啊,我将因使用解析HTML而被 ,但无济于事:

$html = '<tr>
    <th scope="row" style="text-align:left;">Appeared in</th>
    <td class="" style="">1972<sup id="cite_ref-dottcl_2_2-0" class="reference"><a href="#cite_note-dottcl_2-2"><span>[</span>2<span>]</span></a></sup></td>
</tr>
<tr>
    <th scope="row" style="text-align:left;">Usual 
<a href="/wiki/Filename_extension" title="Filename extension">filename extensions</a>
    </th>
    <td class="" style="">.h .c</td>

</tr>';

$html = str_replace("\r", '', str_replace("\n", '', $html)); // Remove new lines
preg_match_all('#<th[^>]*>(.*?)</th>#is', $html, $m); // Match what's between th tag

$result = array_map('strip_tags', $m[1]); // Get ride of html tags
print_r($result);// printing the results

Output: 输出:

Array
(
    [0] => Appeared in
    [1] => Usual filename extensions    
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM