简体   繁体   English

如何使用java从网页中获取值

[英]How to get a value from a webpage using java

In the follwing URL http://www.manta.com/c/mx4s4sw/bowflex-academy I want to get the SIC Code .在以下 URL http://www.manta.com/c/mx4s4sw/bowflex-academy我想获得SIC 代码 Here is my code and error :这是我的代码和错误:

public static void main(String[] args) {
    try {
        Document doc = Jsoup.connect("http://www.manta.com/c/mx4s4sw/bowflex-academy").ignoreHttpErrors(true).get();
        String textContents = doc.select("itemprop").first().text();
    } catch (IOException e) {
        e.printStackTrace();
    }
  }
}

Exception in thread "main" java.lang.NullPointerException at com.inndata.connection.GoogleScraperDemo.main(GoogleScraperDemo.java:22)

The selector "itemprop" is incorrect.选择器"itemprop"不正确。

The SIC code in the document is in a block of HTML that looks like this:文档中的 SIC 代码位于 HTML 块中,如下所示:

  <tr>
      <th class="text-left" style="width:30%;">SIC Code</th>
      <td rel="sicDetails"><span itemprop="isicV4">7991</span>, Physical Fitness Facilities</td>
  </tr>

The selector should be something like选择器应该是这样的

"span[itemprop='isicV4']"

I have not tested this.我没有测试过这个。 Also, this will break whenever the website owners change the layout or itemprop value on that line.此外,只要网站所有者更改该行上的布局或itemprop值,这就会中断。 You could get fancier looking for the string SIC Code and then searching just below, but any such scraping is likely to be brittle to website changes, and there's not much you can do except react after the fact.您可以更高级地查找字符串SIC Code ,然后在下方搜索,但任何此类抓取都可能对网站更改很脆弱,除了事后做出反应之外,您无能为力。

The website, you are trying to scrape doesn't allow scraping.您尝试抓取的网站不允许抓取。 If you use third party tools like Jsoup, HtmlUnit then it will detect it as bot.如果您使用 Jsoup、HtmlUnit 等第三方工具,那么它会将其检测为 bot。

So try using in-built library "java.net" of java to fetch webpage and you are good to scrape.所以尝试使用java的内置库“java.net”来获取网页,你很高兴抓取。

Here are some key steps to proceed -以下是继续进行的一些关键步骤 -

  1. create URL Object from url String -从 url 字符串创建 URL 对象 -

    URL url = new URL(targetPageURLString);

  2. Open http connection through URL -通过 URL 打开 http 连接 -

    HttpURLConnection urlConnection = (HttpURLConnection) url.openConnection();

  3. Read web response from input stream -从输入流读取 Web 响应 -

    InputStream urlStream = urlConnection.getInputStream();

  4. After reading response from stream byte by byte, convert this byte array to String.从流中逐字节读取响应后,将此字节数组转换为字符串。

  5. Using regex, you can get required info/content使用正则表达式,您可以获得所需的信息/内容

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM