简体   繁体   中英

Find exact text containing hyphen in html with jsoup

I have a HTML document in which I need to be able to find exact matches in the document text that could, or could not, contain hyphen. I'm using Java and Jsoup.

The HTML document could for example have the following:

<li>some text ABCDE some text</li>
<li>some text ABCDE-kriterierna some text</li>

or

<li>ABCDE</li>
<li>ABCDE-kriterierna</li>

I have a list of input strings that I need to match to the text in the HTML document. Two of these input strings could be " ABCDE " and " ABCDE-kriterierna ". I need a way with Jsoup, or regex, to match these input words exactly. That is, "ABCDE-kriterierna" should only find the second list element, not the first. And the input word "ABCDE" should only find the first list element, not the second.

For the input word "ABCDE-kriterierna" it's no problem. This Jsoup CSS selector will only find the second list element:

:containsOwn(ABCDE-kriterierna)

The problem is that I can't find a regex/selector to for the input word "ABCDE" to only find the first list element. I can't use the regex \\sABCDE\\s since I can't assume surrounding spaces. I have tried the following, but the all also find "ABCDE-kriterierna".

:matchesOwn(\bABCDE\b)
:containsOwn(ABCDE)

Any ideas? Please help...

I can't assume surrounding spaces since ABCDE could be the only text in an element

Keeping above condition in mind there are two cases when this happens.

  1. ABCDE is a word surrounded by whitespaces. For eg: <li>some text ABCDE some text</li>

  2. ABCDE is only word in list tag with no whitespaces. For eg: <li>ABCDE</li>

Regex: (?<=[>\\s])ABCDE(?=[<\\s])

Explanation:

(?<=[>\\s]) will lookbehind for > (closing angle of li tag) or \\s a whitespace.

ABCDE will search for literal word.

(?=[<\\s]) will lookahead for < (opening angle of li tag) or \\s a whitespace.

Regex101 Demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM