简体   繁体   中英

Grabbing text from websites

i have this small chunk of code that will grab the html code from a website. Im interested in parsing a certain section of the code though, several times. More specifically, im making a pokedex, and would like to parse certain descriptions from say a bulbapedia page, http://bulbapedia.bulbagarden.net/wiki/Bulbasaur_(Pok%C3%A9mon) for example. How would I make this parser take just the description of bulbasaur? How would I create any boundary to stop and start?

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;

public class WebCrawler{
    public static void main(String[] args) {
        try {
            URL google = new URL("http://pokemondb.net/pokedex/bulbasaur");
            URLConnection yc = google.openConnection();
            BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
            String inputLine;
            while ((inputLine = in.readLine()) != null) {
                System.out.println(inputLine);

            }
            in.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Try with Jsoup

Syntax is JQuery selectors liked.

You can use Jsoup , with this code you can get the description of Bulbasaur:

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;


public class Test {

    public static void main(String[] args) throws IOException {

        Document doc = Jsoup
                .connect(
                        "http://bulbapedia.bulbagarden.net/wiki/Bulbasaur_(Pok%C3%A9mon)")
                .get();
        Elements newsHeadlines = doc.select("#mw-content-text p");
        for (Object o : newsHeadlines) {
            System.out.println(o.toString());
        }

    }

}

Where mw-content is the main div.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM