I'm writing some Java code in order to realize NLP tasks upon texts using Wikipedia. How can I use JSoup to extract all the text of a Wikipedia article (for example all the text in http://en.wikipedia.org/wiki/Boston )?
Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Boston").get();
Element contentDiv = doc.select("div[id=content]").first();
contentDiv.toString(); // The result
You retrieve formatted content this way, of course. If you want "raw" content you can filter the result with Jsoup.clean
or use the call contentDiv.text()
.
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select(".mw-content-ltr p");
Element firstParagraph = paragraphs.first();
Element lastParagraph = paragraphs.last();
Element p;
int i=1;
p=firstParagraph;
System.out.println(p.text());
while (p!=lastParagraph){
p=paragraphs.get(i);
System.out.println(p.text());
i++;
}
Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Boston").timeout(5000);
Element iamcontaningIDofintendedTAG= doc.select("#iamID") ;
System.out.println(iamcontaningIDofintendedTAG.toString());
OR
Elements iamcontaningCLASSofintendedTAG= doc.select(".iamCLASS") ;
System.out.println(iamcontaningCLASSofintendedTAG.toString());
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.