简体   繁体   English

使用 Jsoup 的 Java Web Scraping

[英]Java Web Scraping using Jsoup

I'm trying to make a java application which can scrape infos off web sites, and I've done some googling, and managed very simple scraper, but not enough.我正在尝试制作一个可以从网站上抓取信息的 Java 应用程序,我已经做了一些谷歌搜索,并管理了非常简单的抓取工具,但还不够。 It seems that my scraper is not scraping some information on this website, espesially the part where I want to scrape.好像我的scraper没有抓取这个网站上的一些信息,尤其是我想抓取的部分。 在此处输入图片说明

1. 1.

        Elements links = htmlDocument.select("a");
        for (Element link : links) {
           this.links.add(link.attr("href"));
        }
        Elements linksOnPage = htmlDocument.select("a[href]");
        System.out.println("Found (" + linksOnPage.size() + ") links");
        for(Element link : linksOnPage)
        {
            this.links.add(link.absUrl("href"));
        }

I've tried both code, but I cant find that link anywhere in Elements object.我已经尝试了这两种代码,但是在 Elements 对象中的任何地方都找不到该链接。 I believe that those information I want is the result of search, so when my program connects to that url, that information are gone.我相信我想要的那些信息是搜索的结果,所以当我的程序连接到那个 url 时,这些信息就消失了。 How can I solve this?我该如何解决这个问题? I want an program whenever it gets started, scraping the result of that search.我想要一个程序,只要它启动,就可以抓取该搜索的结果。

Here is the link to the web site这是网站链接

So my question is,所以我的问题是,

1.How do I scrape that link into my code's Elements object? 1.如何将该链接抓取到我的代码的 Elements 对象中? What am I doing Wrong?我究竟做错了什么?

2.Is there any way to pick that link and proceed to that link only(not all hyperlinks)? 2.有没有办法选择该链接并仅继续该链接(不是所有超链接)?

    final Document doc = Jsoup.connect("http://www.work.go.kr/empInfo/empInfoSrch/list/dtlEmpSrchList.do?pageIndex=2&pageUnit=10&len=0&tot=0&relYn=N&totalEmpCount=0&jobsCount=0&mainSubYn=N&region=41000&lastIndex=1&siteClcd=all&firstIndex=1&pageSize=10&recordCountPerPage=10&rowNo=0&softMatchingPossibleYn=N&benefitSrchAndOr=O&keyword=CAD&charSet=EUC-KR&startPos=0&collectionName=tb_workinfo&softMatchingMinRate=+66&softMatchingMaxRate=100&empTpGbcd=1&onlyTitleSrchYn=N&onlyContentSrchYn=N&serialversionuid=3990642507954558837&resultCnt=10&sortOrderBy=DESC&sortField=DATE").userAgent(USER_AGENT).get();


    try
    {
        Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);
        Document htmlDocument = connection.get();
        this.htmlDocument = htmlDocument;
        String qqq=htmlDocument.toString();
        System.out.println(qqq);
        if(connection.response().statusCode() == 200) // 200 is the HTTP OK status code
                                                      // indicating that everything is great.
        {
            System.out.println("\n**Visiting** Received web page at " + url);
        }
        if(!connection.response().contentType().contains("text/html"))
        {
            System.out.println("**Failure** Retrieved something other than HTML");
            return false;
        }
        
        Elements linksOnPage = htmlDocument.select("a[href]");
        System.out.println("Found (" + linksOnPage.size() + ") links");
        for(Element link : linksOnPage)
        {
            this.links.add(link.absUrl("href"));
            System.out.println(link.absUrl("href"));
        }
        return true;
    }
    catch(IOException ioe)
    {
        // We were not successful in our HTTP request
        return false;
    }

this is the entire code I use for scraping.这是我用于抓取的完整代码。 This code, I'm using from this site.这段代码,我从这个网站使用。

I found the issue, and couldn't resolve it.我发现了问题,但无法解决。 So, what I was trying was that I wanted to scrape info from a webpage showing some results of specific search.所以,我试图从显示特定搜索结果的网页中抓取信息。 The issue was that the website is somehow not letting me to connect from my java application using jsoup.问题是该网站以某种方式不允许我使用 jsoup 从我的 Java 应用程序进行连接。 Probably to protect their contents.可能是为了保护他们的内容。 That's why there's was no elements I needed, because it's actually not there.这就是为什么没有我需要的元素,因为它实际上并不存在。 The website offers openAPI for charge, so I decided to use other websites.该网站提供了收费的openAPI,所以我决定使用其他网站。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM