简体   繁体   English

JSOUP-从页面直接以特定格式提取数据

[英]JSOUP - Extract data directly in a specific format from a page

I am currently experimenting with jsoup and my goal is to extract data from this retail website, in the form of: 我目前正在尝试jsoup,我的目标是从零售网站中以以下形式提取数据:

 Title: blabl
 Link: foba
 Grösse: 9999
 KP: FALSE
 Miete: TRUE
 Preis: 1923,23

I have written so far this test program: 到目前为止,我已经编写了这个测试程序:

public class jsoup_test {
    public static void main(String[] args) throws IOException {
        String url = "http://derstandard.at/anzeiger/immoweb/Suchergebnis.aspx?Regionen=9&Bezirke=&Arten=&AngebotTyp=&timestamp=1363305908912";
        print("Fetching %s...", url);

        Document doc = Jsoup.connect(url).get();
        Elements price = doc.select("tr.topangebot");
        Elements price1 = doc.select("tr.white");

        System.out.println("--------------------------------"); 
        System.out.println(price);  
        System.out.println("--------------------------------"); 
        System.out.println(price1); 

    }

    private static void print(String msg, Object... args) {
        System.out.println(String.format(msg, args));
    }

}

However, this program gives me my data like that: 但是,该程序为我提供了如下数据:

<tr id="ctl00_Body_mc_cErgebnisListe1_ctl02_InseratInfoTR" class="topangebot"> 
 <td class="BildTD" rowspan="2"> <a href="/anzeiger/immoweb/Detail.aspx?InseratID=6847212&amp;FromTopAngebot=true"><img border="0" src="http://images.derstandard.at/t/22/upload/imagesanzeiger/immoupload/2013/02/27/277515f7-f935-4a13-83fb-dbe3af930e28.jpg" alt="" /></a> </td> 
 <td class="TitleTD" rowspan="2"> <span class="neu">TOP!</span> <strong><a href="/anzeiger/immoweb/Detail.aspx?InseratID=6847212&amp;FromTopAngebot=true">Gehobene Qualit&auml;t, Design und exquisite Ausf&uuml;hrung: Dachausbau mit Weitblick und 100 m&sup2; Terrasse</a></strong><br /><a href="/anzeiger/immoweb/Detail.aspx?InseratID=6847212&amp;FromTopAngebot=true">Wien 16.,Ottakring, Dachgeschoss</a><br /><span style="color: gray">Erstbezug, K&uuml;che, Parkettboden, Hauptmiete, Terrasse, Lift, Keller, Altbau, Kabel/Sat-TV, Barrierefrei</span> </td> 
 <td class="GroessenTD" rowspan="2"> <span class="strong">125 m&sup2;</span><br /><span class="strong">4&nbsp;</span>Zimmer </td> 
 <td class="PreisTD" style="border:none;"> <span class="light">Miete</span>&nbsp;2.190&nbsp;<br /> </td> 
</tr>
<tr id="ctl00_Body_mc_cErgebnisListe1_ctl02_MerklisteTR" class="topangebot"> 
 <td class="merkliste"> </td> 
</tr>
<tr id="ctl00_Body_mc_cErgebnisListe1_ctl03_InseratInfoTR" class="topangebot"> 
 <td class="BildTD" rowspan="2"> <a href="/anzeiger/immoweb/Detail.aspx?InseratID=6871213&amp;FromTopAngebot=true"><img border="0" src="http://images.derstandard.at/t/22/upload/imagesanzeiger/immoimporte/justimmo2/files.justimmo.at/public/pic/big/AEs_YegpKC.JPG" alt="" /></a> </td> 
 <td class="TitleTD" rowspan="2"> <span class="neu">TOP!</span> <strong><a href="/anzeiger/immoweb/Detail.aspx?InseratID=6871213&amp;FromTopAngebot=true">HS-IMMO: 14. PREISSENSATION Eckzinshaus 1414m&sup2; Leerstand - Gesamtnutzfl&auml;che 1670m&sup2; + Rohdachboden ca. 700m&sup2; erzielbar ( Baubescheid ) € 1555.-/m&sup2; NFL</a></strong><br /><a href="/anzeiger/immoweb/Detail.aspx?InseratID=6871213&amp;FromTopAngebot=true">Wien 14.,Penzing, Zinshaus</a><br /><span style="color: gray">Parkettboden, Altbau, Kabel/Sat-TV</span> </td> 
 <td class="GroessenTD" rowspan="2"> <span class="strong">1.670 m&sup2;</span><br /> </td> 
 <td class="PreisTD" style="border:none;"> <span class="light">KP</span>&nbsp;2.590.000&nbsp;<br /> </td> 
</tr>...

Which is not in a human readable format. 不是人类可读的格式。 Therefore my question is. 因此,我的问题是。 How to get jsoup, that it extracts the data DIRECTLY in the Format I want? 如何获得jsoup,它以我想要的格式直接提取数据?

Thx for your replies? 谢谢你的答复?

例如,选择标题时,您需要执行以下操作

String title = doc.select("tr.topangebot > td.TitleTD").first.text();

you can navigate the page using DOM if you know the page structure: 如果您知道页面结构,则可以使用DOM浏览页面:

http://jsoup.org/cookbook/extracting-data/dom-navigation http://jsoup.org/cookbook/extracting-data/dom-navigation

This question has a bunch of good web scrapers 这个问题有很多好的刮板机

Web scraping with Java 用Java进行Web抓取

I like to use Jsoup because it's methods were literally built for DOM traversal. 我喜欢使用Jsoup,因为它的方法实际上是为DOM遍历而构建的。 So, if you are good at HTML, CSS, and Jquery, this library was built for you. 因此,如果您擅长HTML,CSS和Jquery,则将为您构建该库。 Yes, the Jsoup approach may be too fast. 是的,Jsoup方法可能太快了。 Yes, it may not suit your needs. 是的,它可能不符合您的需求。 But, when it comes to gathering any type of information from any type of website, Jsoup is flexible enough to meet your needs. 但是,当涉及从任何类型的网站收集任何类型的信息时,Jsoup足够灵活以满足您的需求。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM