In Java code, how can I extract text of a random html page?

Question

I solved this way:

String url = ("http://www.repubblica.it/economia/finanza/2011/10/27/news/la_fine_dell_incertezza_solleva_le_azioni_bancarie_in_borsa_alle_italiane_mancano_15_miliardi_di_capitale_met_di_unicredit-23967707/");

Document doc = Jsoup.parse(new URL(url), 2000);

Elements body = doc.select("body");

String s=body.text();

System.out.println(s);

I still have another problem. I just want the main text without a title. Who can help me?

I need an algorithm that extracts the text from websites. I want this text is clean from the tags, classes, etc. and I want that this algorithm can be applied to any web page.

For example for this page

I need the main text:

MILANO - Il tanto atteso responso sui fabbisogni di patrimonio delle maggiori banche europee è arrivato. L'Eba (l'Autorità di controllo bancaria europea) ha stabilito la necessità, entro giugno 2012, di ricapitalizzare per ben 106,5 miliardi di euro per i 30 gruppi europei più importanti. Sui 70 gruppi considerati, invece, il deficit patrimoniale è di 160 miliard...............

For this page

I need the main text:

TORINO - Effetto Chrysler sui conti Fiat. Il Lingotto archivia il terzo trimestre con utili in crescita a 17,6 miliardi (8,4 nello stesso trimestre 2010). Più che triplicato l'utile della gestione ordinaria che passa da 256 a 851 milioni. Due terzi arrivano da Detroit che................

Thanks

Answer 1

Try the boilerplate library.

Another option would be to explore Apache Tika , which will index content in a meaningful way.

Note that defining "main text" is largely impossible. If you know the site you can try to understand their template and make some assumptions. Doing it across random sites is difficult, which is where something like boilerpipe/tika come into play.

Answer 2

I've just discovered Jsoup and it looks just perfect for what you want

Seems that something along these lines will extract that text from "div_Id"

Document doc = Jsoup.connect("http://www.repubblica.it/economia/finanza/2011/10/27/news/la_fine_dell_incertezza_solleva_le_azioni_bancarie_in_borsa_alle_italiane_mancano_15_miliardi_di_capitale_met_di_unicredit-23967707/").get();
String text = doc.body().id("div_Id").text()

I'm not the expert on this lib, but indeed is way more easier that httpCommonsClient

In Java code, how can I extract text of a random html page?

Question

2 answers

solution1
2 2011-10-27 18:12:07

solution2
2 2011-10-27 19:47:24

In Java code, how can I extract text of a random html page?

Question

2 answers

solution1 2 2011-10-27 18:12:07

solution2 2 2011-10-27 19:47:24

solution1
2 2011-10-27 18:12:07

solution2
2 2011-10-27 19:47:24