简体   繁体   中英

How to get a value from a webpage using java

In the follwing URL http://www.manta.com/c/mx4s4sw/bowflex-academy I want to get the SIC Code . Here is my code and error :

public static void main(String[] args) {
    try {
        Document doc = Jsoup.connect("http://www.manta.com/c/mx4s4sw/bowflex-academy").ignoreHttpErrors(true).get();
        String textContents = doc.select("itemprop").first().text();
    } catch (IOException e) {
        e.printStackTrace();
    }
  }
}

Exception in thread "main" java.lang.NullPointerException at com.inndata.connection.GoogleScraperDemo.main(GoogleScraperDemo.java:22)

The selector "itemprop" is incorrect.

The SIC code in the document is in a block of HTML that looks like this:

  <tr>
      <th class="text-left" style="width:30%;">SIC Code</th>
      <td rel="sicDetails"><span itemprop="isicV4">7991</span>, Physical Fitness Facilities</td>
  </tr>

The selector should be something like

"span[itemprop='isicV4']"

I have not tested this. Also, this will break whenever the website owners change the layout or itemprop value on that line. You could get fancier looking for the string SIC Code and then searching just below, but any such scraping is likely to be brittle to website changes, and there's not much you can do except react after the fact.

The website, you are trying to scrape doesn't allow scraping. If you use third party tools like Jsoup, HtmlUnit then it will detect it as bot.

So try using in-built library "java.net" of java to fetch webpage and you are good to scrape.

Here are some key steps to proceed -

  1. create URL Object from url String -

    URL url = new URL(targetPageURLString);

  2. Open http connection through URL -

    HttpURLConnection urlConnection = (HttpURLConnection) url.openConnection();

  3. Read web response from input stream -

    InputStream urlStream = urlConnection.getInputStream();

  4. After reading response from stream byte by byte, convert this byte array to String.

  5. Using regex, you can get required info/content

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM