[英]JSOUP - Extract data directly in a specific format from a page
I am currently experimenting with jsoup and my goal is to extract data from this retail website, in the form of: 我目前正在尝试jsoup,我的目标是从零售网站中以以下形式提取数据:
Title: blabl
Link: foba
Grösse: 9999
KP: FALSE
Miete: TRUE
Preis: 1923,23
I have written so far this test program: 到目前为止,我已经编写了这个测试程序:
public class jsoup_test {
public static void main(String[] args) throws IOException {
String url = "http://derstandard.at/anzeiger/immoweb/Suchergebnis.aspx?Regionen=9&Bezirke=&Arten=&AngebotTyp=×tamp=1363305908912";
print("Fetching %s...", url);
Document doc = Jsoup.connect(url).get();
Elements price = doc.select("tr.topangebot");
Elements price1 = doc.select("tr.white");
System.out.println("--------------------------------");
System.out.println(price);
System.out.println("--------------------------------");
System.out.println(price1);
}
private static void print(String msg, Object... args) {
System.out.println(String.format(msg, args));
}
}
However, this program gives me my data like that: 但是,该程序为我提供了如下数据:
<tr id="ctl00_Body_mc_cErgebnisListe1_ctl02_InseratInfoTR" class="topangebot">
<td class="BildTD" rowspan="2"> <a href="/anzeiger/immoweb/Detail.aspx?InseratID=6847212&FromTopAngebot=true"><img border="0" src="http://images.derstandard.at/t/22/upload/imagesanzeiger/immoupload/2013/02/27/277515f7-f935-4a13-83fb-dbe3af930e28.jpg" alt="" /></a> </td>
<td class="TitleTD" rowspan="2"> <span class="neu">TOP!</span> <strong><a href="/anzeiger/immoweb/Detail.aspx?InseratID=6847212&FromTopAngebot=true">Gehobene Qualität, Design und exquisite Ausführung: Dachausbau mit Weitblick und 100 m² Terrasse</a></strong><br /><a href="/anzeiger/immoweb/Detail.aspx?InseratID=6847212&FromTopAngebot=true">Wien 16.,Ottakring, Dachgeschoss</a><br /><span style="color: gray">Erstbezug, Küche, Parkettboden, Hauptmiete, Terrasse, Lift, Keller, Altbau, Kabel/Sat-TV, Barrierefrei</span> </td>
<td class="GroessenTD" rowspan="2"> <span class="strong">125 m²</span><br /><span class="strong">4 </span>Zimmer </td>
<td class="PreisTD" style="border:none;"> <span class="light">Miete</span> 2.190 <br /> </td>
</tr>
<tr id="ctl00_Body_mc_cErgebnisListe1_ctl02_MerklisteTR" class="topangebot">
<td class="merkliste"> </td>
</tr>
<tr id="ctl00_Body_mc_cErgebnisListe1_ctl03_InseratInfoTR" class="topangebot">
<td class="BildTD" rowspan="2"> <a href="/anzeiger/immoweb/Detail.aspx?InseratID=6871213&FromTopAngebot=true"><img border="0" src="http://images.derstandard.at/t/22/upload/imagesanzeiger/immoimporte/justimmo2/files.justimmo.at/public/pic/big/AEs_YegpKC.JPG" alt="" /></a> </td>
<td class="TitleTD" rowspan="2"> <span class="neu">TOP!</span> <strong><a href="/anzeiger/immoweb/Detail.aspx?InseratID=6871213&FromTopAngebot=true">HS-IMMO: 14. PREISSENSATION Eckzinshaus 1414m² Leerstand - Gesamtnutzfläche 1670m² + Rohdachboden ca. 700m² erzielbar ( Baubescheid ) € 1555.-/m² NFL</a></strong><br /><a href="/anzeiger/immoweb/Detail.aspx?InseratID=6871213&FromTopAngebot=true">Wien 14.,Penzing, Zinshaus</a><br /><span style="color: gray">Parkettboden, Altbau, Kabel/Sat-TV</span> </td>
<td class="GroessenTD" rowspan="2"> <span class="strong">1.670 m²</span><br /> </td>
<td class="PreisTD" style="border:none;"> <span class="light">KP</span> 2.590.000 <br /> </td>
</tr>...
Which is not in a human readable format. 不是人类可读的格式。 Therefore my question is.
因此,我的问题是。 How to get jsoup, that it extracts the data DIRECTLY in the Format I want?
如何获得jsoup,它以我想要的格式直接提取数据?
Thx for your replies? 谢谢你的答复?
例如,选择标题时,您需要执行以下操作
String title = doc.select("tr.topangebot > td.TitleTD").first.text();
you can navigate the page using DOM if you know the page structure: 如果您知道页面结构,则可以使用DOM浏览页面:
http://jsoup.org/cookbook/extracting-data/dom-navigation http://jsoup.org/cookbook/extracting-data/dom-navigation
This question has a bunch of good web scrapers 这个问题有很多好的刮板机
I like to use Jsoup because it's methods were literally built for DOM traversal. 我喜欢使用Jsoup,因为它的方法实际上是为DOM遍历而构建的。 So, if you are good at HTML, CSS, and Jquery, this library was built for you.
因此,如果您擅长HTML,CSS和Jquery,则将为您构建该库。 Yes, the Jsoup approach may be too fast.
是的,Jsoup方法可能太快了。 Yes, it may not suit your needs.
是的,它可能不符合您的需求。 But, when it comes to gathering any type of information from any type of website, Jsoup is flexible enough to meet your needs.
但是,当涉及从任何类型的网站收集任何类型的信息时,Jsoup足够灵活以满足您的需求。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.