[英]Regex “\d+” selector selecting digits one by one
I've created a small sample of the string which needs to be filtered: 我创建了一个字符串的小样本,需要对其进行过滤:
https://regex101.com/r/PvXRiC/1 https://regex101.com/r/PvXRiC/1
I would like to get the "61" from the below html: 我想从下面的html中获取“ 61”:
<p class="b-list__count__number">
<span>61</span>/
<span>18786</span>
</p>
As you can see from my example, the "([\\d+])" selector is selecting 6 and 1 is different match: 从我的示例中可以看到,“([[d +])”选择器选择6和1是不同的匹配项:
Is there any way I can get the "61" in a single match? 有什么办法可以让我在单场比赛中获得“ 61”?
Your regex does not work because .*
is a greedy dot pattern that matches the whole line at once, and then starts backtracking, trying to accommodate some text that should be matched by the subsequent subpatterns. 您的正则表达式无法正常工作,因为
.*
是一个贪婪的点模式,该模式一次匹配整行,然后开始回溯,尝试容纳一些应与后续子模式匹配的文本。 Thus, only the last digit lands in the second capturing group as \\d+
can match 1 digit. 因此,只有最后一位落在第二捕获组中,因为
\\d+
可以匹配一位。
Although you may fix the issue by just making .*
lazy with .*?
尽管您可以通过仅使
.*
与.*?
成为惰性来解决此问题.*?
, or a safer [^<]*?
,或更安全的
[^<]*?
, you should not use regex to parse HTML. ,则不应使用正则表达式来解析HTML。
Use HtmlAgilityPack , example: 使用HtmlAgilityPack ,例如:
var html = "<p class=\"b-list__count__number\">\n<span>61</span>/\n<span>18786</span>\n</p>";
HtmlAgilityPack.HtmlDocument hap;
Uri uriResult;
if (Uri.TryCreate(html, UriKind.Absolute, out uriResult))
{ // html is a URL
var doc = new HtmlAgilityPack.HtmlWeb();
hap = doc.Load(uriResult.AbsoluteUri);
}
else
{ // html is a string
hap = new HtmlAgilityPack.HtmlDocument();
hap.LoadHtml(html);
}
var node = hap.DocumentNode.SelectSingleNode("//p[@class='b-list__count__number']");
if (node != null)
{
Console.Write(node.SelectSingleNode("//span").InnerText); // => 61
}
The //p[@class='b-list__count__number']
is an XPath expression that gets a p
node with class
attribute having b-list__count__number
value. //p[@class='b-list__count__number']
是一个XPath表达式,该表达式获取具有class
属性的p
节点具有b-list__count__number
值。 The node.SelectSingleNode("//span").InnerText
gets the inner text of the first span
child node of the p
node found. node.SelectSingleNode("//span").InnerText
获取找到的p
节点的第一个span
子节点的内部文本。
The problem in your regex (<p class="b-list__count__number">\\n<span>.*)([\\d+])
is that .*
is greedy and takes also all the digits save the last one. 正则表达式
(<p class="b-list__count__number">\\n<span>.*)([\\d+])
是.*
贪婪,并且所有数字都保存了最后一位。 You can use [^\\d]*
to stop at the first digit. 您可以使用
[^\\d]*
停在第一位。
(<p class="b-list__count__number">\n<span>[^\d]*)(\d+)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.