[英]How to match the information with the regex expression inside the html tag if the tag is repeating?
Like if I have the tags 就像我有标签一样
<td class="cit-borderleft cit-data">437</td>
<td class="cit-borderleft cit-data">394</td>
<td class="cit-borderleft cit-data">12</td>
<td class="cit-borderleft cit-data">**12**</td>
But I need to match number 12 in the last tag. 但是我需要在最后一个标签中匹配数字12。 I am using the regex expression
"<td class=\\"cit-borderleft cit-data\\">(.*?)</td>"
but it will match all four of the tags. 我使用的是正则表达式
"<td class=\\"cit-borderleft cit-data\\">(.*?)</td>"
但它将匹配所有四个标记。
Don't use regex. 不要使用正则表达式。 Use proper XML/HTML parser like jsoup .
使用适当的XML / HTML解析器,例如jsoup 。 If you simply want to get text from last element of type
td
with classes cit-borderleft cit-data
you can use 如果您只想从
cit-borderleft cit-data
类的td
类型的最后一个元素中获取文本,则可以使用
String html =
"<table>" +
"<td class=\"cit-borderleft cit-data\">437</td>\r\n" +
"<td class=\"cit-borderleft cit-data\">394</td>\r\n" +
"<td class=\"cit-borderleft cit-data\">12</td>\r\n" +
"<td class=\"cit-borderleft cit-data\">**12**</td>" +
"</table>";
Document doc = Jsoup.parse(html);
Element last = doc.select("td.cit-borderleft.cit-data").last();
System.out.println(last.text());
Output: **12**
输出:
**12**
If you then want to remove these *
simply call replace("*","")
on that string and you will get new one without asterisks. 如果随后要删除这些
*
只需在该字符串上调用replace("*","")
,您将获得一个新的不带星号的字符串。
Try this: 尝试这个:
<td class=\"cit-borderleft cit-data\">\*\*(.*?)\*\*<\/td>
or simple way, this: 或简单的方法,这:
\*\*(\d+)\*\*
Based on your attempt 根据您的尝试
<td class=\"cit-borderleft cit-data\">(.*?)<\/td>(?![\s\S]*<\/td>)
Demo 演示版
added this part (?![\\s\\S]*<\\/td>)
添加了这部分
(?![\\s\\S]*<\\/td>)
(?! # Negative Look-Ahead
[\s\S] # Character in [\s\S] Character Class
* # (zero or more)(greedy)
< # "<"
\/ # "/"
td> # "td>"
) # End of Negative Look-Ahead
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.