简体   繁体   English

通过使用JSoup抓取html来构建字符串

[英]Building Strings by scraping html with JSoup

I'm a novice Java programmer, and am just now beginning to expand into the world of libraries, APIs, and the like. 我是一名新手Java程序员,现在才刚刚开始拓展图书馆,API等世界。 I'm at the point where I have an idea that is relatively simple, and can be my pet project when I'm not working on homework. 我正处于一个相对简单的想法,当我不做家庭作业时可以成为我的宠物项目。

I'm interested in scraping html from a few different sites, and building strings that look like " Artist - "Track Name" ". 我有兴趣从几个不同的站点抓取html,并构建看起来像“艺术家 - ”轨道名称“”的字符串。 I've got one site working the way I want, but I feel it could be done a lot more smoothly... Here's the rundown on what I do for Site A: 我有一个网站以我想要的方式工作,但我觉得它可以更顺利地完成......这就是我为网站A所做的事情的简要说明:

I have JSoup create Elements for everything that is of the class plrow like so: 我有JSoup为类plrow的所有内容创建元素,如下所示:

<p class="plrow"><b><a href="playlist.php?station=foo">Artist</a></b> “Title” (<span class="sn_ld"><a href="playlist.php?station=foo">Label</a></span>) <SMALL><b>N </b></SMALL></p></td></tr><tr class="ev"><td><a name="98069"></a><p class="pltime">Time</p>

From there, I create a String array of lines that are split after the last </p> , then use the following code to process the array: 从那里,我创建一个在最后一个</p>之后拆分的String数组,然后使用以下代码处理数组:

for (int i = 0; i < tracks.length; i++){
            tracks[i] = Jsoup.parse(tracks[i]).text();
            tracks[i] = tracks[i].split("”")[0];
            tracks[i] = tracks[i].toString()+ "”";          
        }

Which is a pretty hackish way to get Artist "Title" the way I want, but the result is fine for me. 这是以我想要的方式获得Artist "Title"的一种非常黑客的方式,但结果对我来说很好。

Site B is a little bit different. 站点B有点不同。

I've determined that the Artists and Titles are all contained like this: <span class="artist" property="foaf:name">Artist Name</span> </a> </span> <span class="title" property="dc:title">Title</span> 我已经确定艺术家和标题都包含这样: <span class="artist" property="foaf:name">Artist Name</span> </a> </span> <span class="title" property="dc:title">Title</span>

along with more information, all inside of <li id="segmentevent-random" class="segment track" typeof="po:MusicSegment" about="/url"> song info </li> 随着更多的信息,所有内部的<li id="segmentevent-random" class="segment track" typeof="po:MusicSegment" about="/url"> song info </li>

I was trying to go through and snag all of the artists first, and then the titles and then merge the two, but I was having trouble with that because the "dc:title" property used to display the track title is used for other non music things, so I can't directly match up the artist with a track. 我试图通过并首先抓住所有的艺术家,然后是标题然后合并两个,但我遇到了麻烦,因为用于显示曲目标题的“dc:title”属性用于其他非音乐的东西,所以我不能直接匹配艺术家的轨道。

I have spent the lion's share of this weekend trying to get this working by viewing countless questions tagged with Jsoup, and spending a lot of time reading the Jsoup cookbook and API guide. 本周末,我花了大部分时间试图通过查看Jsoup标记的无数问题,并花费大量时间阅读Jsoup烹饪书和API指南来实现这一目标。 I have a feeling that part of my trouble could also stem from my relatively limited knowledge of how web pages are coded, though that may mostly be my trouble with my understanding of how to plug these bits of code into Jsoup. 我有一种感觉,我的部分麻烦也可能源于我对网页编码方式的相对有限的了解,尽管这可能主要是我理解如何将这些代码插入Jsoup的麻烦。

I appreciate any help or guidance, and I've got to say, it's really nice to ask a non-homework question here (though I find quite a few hints from what others have asked! ;) ) 我感谢任何帮助或指导,我必须说,在这里提出一个非作业问题真的很好(虽然我从其他人的问题中找到了一些提示!))

Common: 共同:

If you have some different websites where you want to parse content its a good idea to differ between them. 如果您有一些不同的网站要解析内容,那么最好区分它们。 Maybe you can decide if you parse Page A or Page B by the URL. 也许您可以决定是否通过URL解析页面A或页面B.

Example: 例:

if( urlToPage.contains("pagea.com") )
{
    // Call parsemethod for Page A or create parserclass
}
else if( urlToPage.contains("pageb.com") )
{
    // Call parsemethod for Page B or create parserclass
}
// ... 
else
{
    // Eg. throw Exception because there's no parser available
}

You can connect and parse each page into a document with a single line of code: 您可以使用一行代码将每个页面连接并解析为一个文档:

// Note: the protocol (http) is required here
Document doc = Jsoup.connect("http://pagewhaterver.com").get(); 

Without knowing the Html or the structure of each page, here are some basic approaches: 在不知道Html或每个页面的结构的情况下,这里有一些基本方法:

Page A: 页面A:

for( Element element : doc.select("p.plrow") )
{
    String title = element.ownText();                           // Title - output: '“Title” ()' (you have to replace the " and () here)
    String artist = element.select("a").first().text();         // Artist
    String label = element.select("span.sn_ld").first().text(); // Label

    // etc.
}

Page B: 第B页:

Similar to Page B, Artitst and Title can be selected like this: 与页面B类似,可以选择Artitst和Title,如下所示:

String artist = doc.select("span.artist").first().text();
String title = doc.select("span.title").first().text();

Here's a good overview of the Jsoup Selector API: http://jsoup.org/cookbook/extracting-data/selector-syntax 以下是Jsoup Selector API的概述: http ://jsoup.org/cookbook/extracting-data/selector-syntax

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM