简体   繁体   English

如何使用JSoup(java)正确解析数据

[英]How do I correctly parse data using JSoup (java)

I want to parse the data out of this HTML (CompanyName, Location, jobDescription,...) using JSoup (java). 我想使用JSoup(java)从该HTML(CompanyName,Location,jobDescription等)解析数据。 I get stuck when trying to iterate the joblistings 尝试迭代工作清单时卡住了

The extract from the HTML is one of many "JOBLISTING" divs which I want to iterate and extract the Data out of it. HTML的提取是我要迭代并从中提取数据的许多“ JOBLISTING” div之一。 I just can't handle how to iterate the specific div objects. 我只是无法处理如何迭代特定的div对象。 Sorry for this noob question, but maybe someone can help me who already knows which function to use. 抱歉,这个菜鸟问题,但是也许有人可以帮助我已经知道要使用哪个功能。 Select? 选择?

<div class="between_listings"><!-- local.spacer --></div>

<div id="joblisting-2944914" class="joblisting listing-even listing-even company-98028 " itemscope itemtype="http://schema.org/JobPosting">


<div class="company_logo" itemprop="hiringOrganization" itemscope itemtype="http://schema.org/Organization">
     <a href="/stellenangebote-des-unternehmens--Delivery-Hero-Holding-GmbH--98028.html" title="Jobs Delivery Hero Holding GmbH" itemprop="url">
       <img src="/upload_de/logo/D/logoDelivery-Hero-Holding-GmbH-98028DE.gif" alt="Logo Delivery Hero Holding GmbH" itemprop="image" width="160" height="80" />
     </a>
</div>


<div class="job_info">


<div class="h3 job_title">
   <a id="jobtitle-2944914" href="/stellenangebote--Junior-Business-Intelligence-Analyst-CRM-m-f-Berlin-Delivery-Hero-Holding-GmbH--2944914-inline.html?ssaPOP=204&ssaPOR=203" title="Arbeiten bei Delivery Hero Holding GmbH" itemprop="url">
      <span itemprop="title">Junior Business Intelligence Analyst / CRM (m/f)</span>
   </a>
</div>

<div class="h3 company_name" itemprop="hiringOrganization" itemscope itemtype="http://schema.org/Organization">

    <span itemprop="name">Delivery Hero Holding GmbH</span>

</div>

</div>




<div class="job_location_date">

    <div class="job_location target-location">
         <div class="job_location_info" itemprop="jobLocation" itemscope itemtype="http://schema.org/Place">


            <div class="h3 locality" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
                  <span itemprop="addressLocality"> Berlin</span>
            </div>


            <span class="location_actions">
                <a href="javaScript:PopUp('http://www.stepstone.de/5/standort.html?OfferId=2944914&ssaPOP=203&ssaPOR=203','resultList',800,520,1)" class="action_showlistingonmap showlabel" title="Google Maps" itemprop="maps">
                   <span class="location-icon"><!-- --></span>
                   <span class="location-label">Google Maps</span>
                </a>
            </span>

          </div>
       </div>

       <div class="job_date_added" itemprop="datePosted"><time datetime="2014-07-04">04.07.14</time></div>
</div>


<div class="job_actions">


</div>

</div>
<div class="between_listings"><!-- local.spacer --></div>

File input = new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt"); 文件输入=新文件(“ C:/ Talend /工作区/WEBCRAWLER/output/keywords_SOA.txt”); // Load file into extraction1 Document ParseResult = Jsoup.parse(input, "UTF-8", " http://example.com/ "); //将文件加载到extract1文档中ParseResult = Jsoup.parse(input,“ UTF-8”,“ http://example.com/ ”); Elements jobListingElements = ParseResult.select(".joblisting"); 元素jobListingElements = ParseResult.select(“。joblisting”); for (Element jobListingElement: jobListingElements) { jobListingElement.select(".companyName span[itemprop=\\"name\\"]"); 对于(元素jobListingElement:jobListingElements){jobListingElement.select(“。companyName span [itemprop = \\” name \\“]”)); // other element properties System.out.println(jobListingElements); //其他元素属性System.out.println(jobListingElements);

Java code: Java代码:

File input = new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt");
// Load file into extraction1       
Document ParseResult = Jsoup.parse(input, "UTF-8", "http://example.com/");                          
Elements jobListingElements = ParseResult.select(".joblisting");        
for (Element jobListingElement: jobListingElements) {         
    jobListingElement.select(".companyName span[itemprop=\"name\"]");         
    // other element properties         
    System.out.println(jobListingElements);
}

Thank you! 谢谢!

So you got your Jsoup document right? 因此,您正确掌握了Jsoup文档? Than it seems pretty easy if the css class joblisting does not appear anywhere else. 如果css类joblisting没有出现在其他任何地方,这似乎很容易。

Document document = Jsoup.parse(new File("d:/bla.html"), "utf-8");
Elements elements = document.select(".joblisting");
for (Element element : elements) {
    Elements jobTitleElement = element.select(".job_title span");
    Elements companyNameElement = element.select(".company_name spanspan[itemprop=name]");
    String companyName = companyNameElement.text();
    String jobTitle = jobTitleElement.text();

    System.out.println(companyName);
    System.out.println(jobTitle);
}

I don't know why the attribute [itemprop*=\\"name\\"] selector does not find the span (Further reading: http://jsoup.org/cookbook/extracting-data/selector-syntax ) 我不知道为什么 [itemprop*=\\"name\\"]属性选择器找不到跨度(进一步阅读: http : //jsoup.org/cookbook/extracting-data/selector-syntax

Got it: span[itemprop=name] without any quotes or escapes. 知道了:span [itemprop = name],不带引号或转义符。 Other attributes or values also should work to get a more specific selection. 其他属性或值也应该起作用以获得更具体的选择。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM