简体   繁体   中英

Jsoup: get all heading tags

I'm trying to parse an html document with Jsoup to get all heading tags. In addition I need to group the heading tags as [h1] [h2] etc...

     hh = doc.select("h[0-6]");

but this give me an empty array.

Your selector means h-Tag with attribute "0-6" here - not a regex. But you can combine multiple selectors instead: hh = doc.select("h0, h1, h2, h3, h4, h5, h6"); .

Grouping: do you need a group with all h-Tags + a group for each h1, h2, ... tag or only a group for each h1, h2, ... tag?

Here's an example how you can do this:

// Group of all h-Tags
Elements hTags = doc.select("h1, h2, h3, h4, h5, h6");

// Group of all h1-Tags
Elements h1Tags = hTags.select("h1");
// Group of all h2-Tags
Elements h2Tags = hTags.select("h2");
// ... etc.

If you want a group for each h1, h2, ... tag you can drop first selector and replace hTags with doc in the others.

Use doc.select("h1,h2,h3,h4,h5,h6") to get all heading tags. Use doc.select("h1") to get each of those tags separately. See the various things you can do with a select statement in http://preciselyconcise.com/apis_and_installations/jsoup/j_selector.php

Here is a Scala version of the answer that uses Ammonite's syntax to specify the Maven coordinates for Jsoup:

import $ivy.`org.jsoup:jsoup:1.11.3`
val html = scala.io.Source.fromURL("https://scalacourses.com").mkString
val doc = org.jsoup.Jsoup.parse(html)
doc.select("h1, h2, h3, h4, h5, h6, h7").eachText()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM