简体   繁体   中英

Building Strings by scraping html with JSoup

I'm a novice Java programmer, and am just now beginning to expand into the world of libraries, APIs, and the like. I'm at the point where I have an idea that is relatively simple, and can be my pet project when I'm not working on homework.

I'm interested in scraping html from a few different sites, and building strings that look like " Artist - "Track Name" ". I've got one site working the way I want, but I feel it could be done a lot more smoothly... Here's the rundown on what I do for Site A:

I have JSoup create Elements for everything that is of the class plrow like so:

<p class="plrow"><b><a href="playlist.php?station=foo">Artist</a></b> “Title” (<span class="sn_ld"><a href="playlist.php?station=foo">Label</a></span>) <SMALL><b>N </b></SMALL></p></td></tr><tr class="ev"><td><a name="98069"></a><p class="pltime">Time</p>

From there, I create a String array of lines that are split after the last </p> , then use the following code to process the array:

for (int i = 0; i < tracks.length; i++){
            tracks[i] = Jsoup.parse(tracks[i]).text();
            tracks[i] = tracks[i].split("”")[0];
            tracks[i] = tracks[i].toString()+ "”";          
        }

Which is a pretty hackish way to get Artist "Title" the way I want, but the result is fine for me.

Site B is a little bit different.

I've determined that the Artists and Titles are all contained like this: <span class="artist" property="foaf:name">Artist Name</span> </a> </span> <span class="title" property="dc:title">Title</span>

along with more information, all inside of <li id="segmentevent-random" class="segment track" typeof="po:MusicSegment" about="/url"> song info </li>

I was trying to go through and snag all of the artists first, and then the titles and then merge the two, but I was having trouble with that because the "dc:title" property used to display the track title is used for other non music things, so I can't directly match up the artist with a track.

I have spent the lion's share of this weekend trying to get this working by viewing countless questions tagged with Jsoup, and spending a lot of time reading the Jsoup cookbook and API guide. I have a feeling that part of my trouble could also stem from my relatively limited knowledge of how web pages are coded, though that may mostly be my trouble with my understanding of how to plug these bits of code into Jsoup.

I appreciate any help or guidance, and I've got to say, it's really nice to ask a non-homework question here (though I find quite a few hints from what others have asked! ;) )

Common:

If you have some different websites where you want to parse content its a good idea to differ between them. Maybe you can decide if you parse Page A or Page B by the URL.

Example:

if( urlToPage.contains("pagea.com") )
{
    // Call parsemethod for Page A or create parserclass
}
else if( urlToPage.contains("pageb.com") )
{
    // Call parsemethod for Page B or create parserclass
}
// ... 
else
{
    // Eg. throw Exception because there's no parser available
}

You can connect and parse each page into a document with a single line of code:

// Note: the protocol (http) is required here
Document doc = Jsoup.connect("http://pagewhaterver.com").get(); 

Without knowing the Html or the structure of each page, here are some basic approaches:

Page A:

for( Element element : doc.select("p.plrow") )
{
    String title = element.ownText();                           // Title - output: '“Title” ()' (you have to replace the " and () here)
    String artist = element.select("a").first().text();         // Artist
    String label = element.select("span.sn_ld").first().text(); // Label

    // etc.
}

Page B:

Similar to Page B, Artitst and Title can be selected like this:

String artist = doc.select("span.artist").first().text();
String title = doc.select("span.title").first().text();

Here's a good overview of the Jsoup Selector API: http://jsoup.org/cookbook/extracting-data/selector-syntax

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM