在Java代码中，如何提取随机html页面的文本？

Question

I solved this way: 我这样解决了：

String url = ("http://www.repubblica.it/economia/finanza/2011/10/27/news/la_fine_dell_incertezza_solleva_le_azioni_bancarie_in_borsa_alle_italiane_mancano_15_miliardi_di_capitale_met_di_unicredit-23967707/");

Document doc = Jsoup.parse(new URL(url), 2000);

Elements body = doc.select("body");

String s=body.text();

System.out.println(s);

I still have another problem. 我还有另一个问题。 I just want the main text without a title. 我只想要没有标题的正文。 Who can help me? 谁能帮我？

I need an algorithm that extracts the text from websites. 我需要一种从网站提取文本的算法。 I want this text is clean from the tags, classes, etc. and I want that this algorithm can be applied to any web page. 我希望此文本可以从标签，类等中清除，并且希望此算法可以应用于任何网页。

For example for this page 例如此页面

I need the main text: 我需要正文：

MILANO - Il tanto atteso responso sui fabbisogni di patrimonio delle maggiori banche europee è arrivato. 米兰-Il tanto atteso responso sui fabbisogni di patrimonio delle maggiori banche europeeèarrivato。 L'Eba (l'Autorità di controllo bancaria europea) ha stabilito la necessità, entro giugno 2012, di ricapitalizzare per ben 106,5 miliardi di euro per i 30 gruppi europei più importanti. L'Eba（l'Autoritàdi controllo bancaria europea）ha stabilito lanecessità，entro giugno 2012，di ricapitalizzare per ben 106,5 miliardi di euro per i 30 gruppi europeipiù重要 Sui 70 gruppi considerati, invece, il deficit patrimoniale è di 160 miliard............... Sui 70 gruppi thinkati，invece，il budget patrimonialeèdi 160 miliard ......

For this page 对于此页面

I need the main text: 我需要正文：

TORINO - Effetto Chrysler sui conti Fiat. 都灵-埃菲特·克莱斯勒的续约菲亚特。 Il Lingotto archivia il terzo trimestre con utili in crescita a 17,6 miliardi (8,4 nello stesso trimestre 2010). Il Lingotto archivia il terzo trimestre con utili in crescita a 17,6 miliardi（8,4 nello stesso trimestre 2010）。 Più che triplicato l'utile della gestione ordinaria che passa da 256 a 851 milioni. 256 851 milioni的三倍重复使用法。 Due terzi arrivano da Detroit che................ Due Terzi Arrivano da Detroit che ................

Thanks 谢谢

Answer 1

Try the boilerplate library. 尝试样板库。

Another option would be to explore Apache Tika , which will index content in a meaningful way. 另一个选择是探索Apache Tika ，它将以有意义的方式索引内容。

Note that defining "main text" is largely impossible. 注意，定义“主要文本”在很大程度上是不可能的。 If you know the site you can try to understand their template and make some assumptions. 如果您知道该站点，则可以尝试了解其模板并做出一些假设。 Doing it across random sites is difficult, which is where something like boilerpipe/tika come into play. 在随机站点上执行此操作很困难，这就是样机/提卡的作用。

Answer 2

I've just discovered Jsoup and it looks just perfect for what you want 我刚刚发现了Jsoup ，它看起来非常适合您想要的东西

Seems that something along these lines will extract that text from "div_Id" 似乎这些内容会从“ div_Id”中提取该文本

Document doc = Jsoup.connect("http://www.repubblica.it/economia/finanza/2011/10/27/news/la_fine_dell_incertezza_solleva_le_azioni_bancarie_in_borsa_alle_italiane_mancano_15_miliardi_di_capitale_met_di_unicredit-23967707/").get();
String text = doc.body().id("div_Id").text()

I'm not the expert on this lib, but indeed is way more easier that httpCommonsClient 我不是该库的专家，但确实比httpCommonsClient更容易

在Java代码中，如何提取随机html页面的文本？

问题描述

2 个解决方案

解决方案1
2 2011-10-27 18:12:07

解决方案2
2 2011-10-27 19:47:24

在Java代码中，如何提取随机html页面的文本？

问题描述

2 个解决方案

解决方案1 2 2011-10-27 18:12:07

解决方案2 2 2011-10-27 19:47:24

解决方案1
2 2011-10-27 18:12:07

解决方案2
2 2011-10-27 19:47:24