[英]Joining decendent text nodes of XML/HTML in Xpath
I have this html fragment: 我有这个html片段:
<tr>
<th scope="row" style="text-align:left;">Appeared in</th>
<td class="" style="">1972<sup id="cite_ref-dottcl_2_2-0" class="reference"><a href="#cite_note-dottcl_2-2"><span>[</span>2<span>]</span></a></sup></td>
</tr>
<tr>
<th scope="row" style="text-align:left;">Usual
<a href="/wiki/Filename_extension" title="Filename extension">filename extensions</a>
</th>
<td class="" style="">.h .c</td>
</tr>
I am using //th//text()
expression to parse it. 我正在使用
//th//text()
表达式来解析它。
The problem is it's returning ['Appeared in', 'Usual', 'filename extensions']
. 问题在于它正在返回
['Appeared in', 'Usual', 'filename extensions']
。
What I want is ['Appeared in', 'Usual filename extensions']
. 我想要的是
['Appeared in', 'Usual filename extensions']
。
You need XPath 2.0 for doing this which is not supported by most XML-libraries of those scripting languages (including scrapy). 您需要XPath 2.0来执行此操作,而这些脚本语言(包括scrapy)的大多数XML库都不支持XPath 2.0。
If you can use a more capable XPath processor (also have a look at XQuery 1.0 and newer, they all include at least XPath 2.0 as a subset), use: 如果您可以使用功能更强大的XPath处理器(还可以查看XQuery 1.0和更高版本,它们都至少包含XPath 2.0作为子集),请使用:
//th/data()
/data()
is equivalent to /data(.)
which calls the function for the current context. /data()
等效于/data(.)
,后者为当前上下文调用该函数。
data()
vs text()
data()
与text()
While text()
is not a function call, but node filter (and thus //text()
is an axis step adding all text nodes individually to the result sequence), data()
is a function aggregating all data for the current context (here: each <th/>
individually). 虽然
text()
不是函数调用,但是是节点过滤器(因此//text()
是将所有文本节点分别添加到结果序列的轴步),而data()
是一个函数,用于聚合当前上下文的所有数据(在这里:每个<th/>
分别)。
There is no way to call any function that concatenates strings for each table header element individually: Function calls in axis steps are not supported, nor are explicit loops like they're possible in XPath 2.0. 无法调用为每个表头元素单独连接字符串的任何函数:不支持轴步中的函数调用,也不支持显式循环,如XPath 2.0中可能的那样。
Ah I will get downvoted for parsing HTML with regex , but can't be helped: 啊,我将因使用regex解析HTML而被淘汰 ,但无济于事:
$html = '<tr>
<th scope="row" style="text-align:left;">Appeared in</th>
<td class="" style="">1972<sup id="cite_ref-dottcl_2_2-0" class="reference"><a href="#cite_note-dottcl_2-2"><span>[</span>2<span>]</span></a></sup></td>
</tr>
<tr>
<th scope="row" style="text-align:left;">Usual
<a href="/wiki/Filename_extension" title="Filename extension">filename extensions</a>
</th>
<td class="" style="">.h .c</td>
</tr>';
$html = str_replace("\r", '', str_replace("\n", '', $html)); // Remove new lines
preg_match_all('#<th[^>]*>(.*?)</th>#is', $html, $m); // Match what's between th tag
$result = array_map('strip_tags', $m[1]); // Get ride of html tags
print_r($result);// printing the results
Output: 输出:
Array
(
[0] => Appeared in
[1] => Usual filename extensions
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.