I'm trying to parse an html document with Jsoup to get all heading tags. In addition I need to group the heading tags as [h1] [h2] etc...
hh = doc.select("h[0-6]");
but this give me an empty array.
Your selector means h-Tag with attribute "0-6" here - not a regex. But you can combine multiple selectors instead: hh = doc.select("h0, h1, h2, h3, h4, h5, h6");
.
Grouping: do you need a group with all h-Tags + a group for each h1, h2, ... tag or only a group for each h1, h2, ... tag?
Here's an example how you can do this:
// Group of all h-Tags
Elements hTags = doc.select("h1, h2, h3, h4, h5, h6");
// Group of all h1-Tags
Elements h1Tags = hTags.select("h1");
// Group of all h2-Tags
Elements h2Tags = hTags.select("h2");
// ... etc.
If you want a group for each h1, h2, ... tag you can drop first selector and replace hTags
with doc
in the others.
Use doc.select("h1,h2,h3,h4,h5,h6") to get all heading tags. Use doc.select("h1") to get each of those tags separately. See the various things you can do with a select statement in http://preciselyconcise.com/apis_and_installations/jsoup/j_selector.php
Here is a Scala version of the answer that uses Ammonite's syntax to specify the Maven coordinates for Jsoup:
import $ivy.`org.jsoup:jsoup:1.11.3`
val html = scala.io.Source.fromURL("https://scalacourses.com").mkString
val doc = org.jsoup.Jsoup.parse(html)
doc.select("h1, h2, h3, h4, h5, h6, h7").eachText()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.