简体   繁体   English

类似正则表达式匹配的xpath表达式?

[英]xpath expression for regex-like matching?

I want to search div id in an html doc with certain pattern. 我想在具有特定模式的html doc中搜索div id。 I want to match this pattern in regex: 我想在正则表达式中匹配此模式:

foo_([[:digit:]]{1.8})

using xpath. 使用xpath。 What is the xpath equivalent for the above pattern? 上述模式的xpath等价物是多少?

I'm stuck with //div[@id="foo_ and then what? If someone could continue a legal expression for it. 我坚持使用//div[@id="foo_然后是什么?如果有人可以为它继续法律表达。

EDIT 编辑

Sorry, I think I have to elaborate more. 对不起,我想我必须详细说明。 Actually it's not foo_ , it's post_message_ 实际上它不是foo_ ,它是post_message_

Btw, I use mechanize/nokogiri ( ruby ) 顺便说一下,我使用mechanize / nokogiri(红宝石)

Here's the snippet : 这是片段:

html_doc = Nokogiri::HTML(open(myfile))
message_div =  html_doc.xpath('//div[substring(@id,13) = "post_message_" and substring-after(@id, "post_message_") => 0 and substring-after(@id, "post_message_") <= 99999999]') 

Still failed. 仍然失败了。 Error message: 错误信息:

Couldn't evaluate expression ' //div[substring(@id,13) = "post_message_" and substring-after(@id, "post_message_") => 0 and substring-after(@id, "post_message_") <= 99999999] ' (Nokogiri::XML::XPath::SyntaxError) 无法计算表达式' //div[substring(@id,13) = "post_message_" and substring-after(@id, "post_message_") => 0 and substring-after(@id, "post_message_") <= 99999999] '(Nokogiri :: XML :: XPath :: SyntaxError)

How about this (updated): 怎么样(更新):

XPath 1.0: XPath 1.0:

"//div[substring-before(@id, '_') = 'foo' 
       and substring-after(@id, '_') >= 0 
       and substring-after(@id, '_') <= 99999999]"

Edit #2: The OP made a change to the question. 编辑#2:OP对问题进行了更改。 The following, even more reduced XPath 1.0 expression works for me: 以下,更加简化的XPath 1.0表达式对我有用:

"//div[substring(@id, 1, 13) = 'post_message_' 
       and substring(@id, 14) >= 0 
       and substring(@id, 14) <= 99999999]"

XPath 2.0 has a convenient matches() function : XPath 2.0有一个方便的matches()函数

"//div[matches(@id, '^foo_\d{1,8}$')]"

Apart from the better portability, I would expect the numerical expression (XPath 1.0 style) to perform better than the regex test, though this would only become noticeable when processing large data sets. 除了更好的可移植性之外,我希望数值表达式(XPath 1.0样式)的性能优于正则表达式测试,尽管这只会在处理大型数据集时变得明显。


Original version of the answer: 原始版本的答案:

"//div[substring-before(@id, '_') = 'foo' 
       and number(substring-after(@id, '_')) = substring-after(@id, '_') 
       and number(substring-after(@id, '_')) &gt;= 0 
       and number(substring-after(@id, '_')) &lt;= 99999999]"

The use of the number() function is unnecessary, because the mathematical comparison operators coerce their arguments to numbers implicitly, any non-numbers will become NaN and the greater than/less than tests will fail. 使用number()函数是不必要的,因为数学比较运算符隐式地将它们的参数强制转换为数字,任何非数字将变为NaN并且大于/小于测试将失败。

I also removed the encoding of the angle brackets, since this is an XML requirement, not an XPath requirement. 我还删除了尖括号的编码,因为这是XML要求,而不是XPath要求。

As already pointed out, in XPath 2.0 it would be good to use its standard regex capabilities with a function like the matches() function. 正如已经指出的那样,在XPath 2.0中 ,将标准的正则表达式功能与matches()函数一起使用会更好。

One possible XPath 1.0 solution : 一个可能的XPath 1.0解决方案

//div[starts-with(@id, 'post_message_')
    and
      string-length(@id) = 21
    and
      translate(substring-after(@id, 'post_message_'), 
                '0123456789', 
                ''
                )
     =
      ''
      ] 

Do note the following : 请注意以下事项

  1. The use of the standard XPath function starts-with() . 使用标准XPath函数starts-with()

  2. The use of the standard XPath function string-length() . 使用标准的XPath函数string-length()

  3. The use of the standard XPath function substring-after() . 使用标准的XPath函数substring-after()

  4. The use of the standard XPath function translate() . 使用标准的XPath函数translate()

Or use xpath function matches(string,pattern). 或者使用xpath函数匹配(字符串,模式)。

  <xsl:if test="matches(name(.),'foo_')">

Unfortunately it's not regex, but it might be enough unless you have other foo_ tags you don't need, then I Guess you can add a few more "if" checks to cull them out. 不幸的是它不是正则表达式,但它可能就足够了,除非你有其他不需要的foo_标签,然后我猜你可以添加一些“if”检查来剔除它们。

Nikkou使这非常简单易读:

doc.search('div').attr_matches('id', /post_message_\d{1,8}/)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM