简体   繁体   English

控制器服务上的多个连接(Spring)

[英]Multiple connections on the controller service (Spring)

I have written a controller which takes as a input the domain name , crawls the whole site and gives back the result in JSON format 我编写了一个控制器,它将域名作为输入,抓取整个站点并以JSON格式返回结果

http://crawlmysite-tgugnani.rhcloud.com/getUrlCrawlData/www.google.com http://crawlmysite-tgugnani.rhcloud.com/getUrlCrawlData/www.google.com

This gives the data google 这给谷歌数据

http://crawlmysite-tgugnani.rhcloud.com/getUrlCrawlData/www.yahoo.com http://crawlmysite-tgugnani.rhcloud.com/getUrlCrawlData/www.yahoo.com

This gives data for yahoo 这为雅虎提供了数据

If I try to run these two URL's simultaneously, I see that I am getting the mixed data, and the results of one is affecting the another, even though I try to hit them from different machines. 如果我尝试同时运行这两个URL,我看到我得到了混合数据,并且一个的结果正在影响另一个,即使我试图从不同的机器上击中它们。

Here is my controller 这是我的控制器

@RequestMapping("/getUrlCrawlData/{domain:.+}")
@ResponseBody
 public String registerContact(@PathVariable("domain") String domain) throws       HttpStatusException, SQLException, IOException {
      List<URLdata> urldata = null;
    Gson gson = new Gson();
     String json;
     urldata = crawlService.crawlURL("http://"+domain);
     json = gson.toJson(urldata);
     return json;
 }

What do I need to do modify to allow many multiple independent connections. 我需要做什么修改以允许多个独立连接。

Update 更新

Following is my crawl Service 以下是我的抓取服务

public List<URLdata> crawlURL(String domain) throws HttpStatusException, SQLException, IOException{
    testDomain = domain;
    urlList.clear();
    urlMap.clear();
    urldata.clear();
    urlList.add(testDomain);
    processPage(testDomain);
    //Get all pages
    for(int i = 1; i < urlList.size(); i++){
        if(urlList.size()>=500){
            break;
        }
        processPage(urlList.get(i));
        //System.out.println(urlList.get(i));
    }
    //Calculate Time
    for(int i = 0; i < urlList.size(); i++){
        getTitleAndMeta(urlList.get(i));
    }
    return urldata;
}

public static void processPage(String URL) throws SQLException, IOException, HttpStatusException{

    //get useful information
try{

    Connection.Response response = Jsoup.connect(URL)
            .userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
            .timeout(10000)
            .execute();
    Document doc = response.parse();

    //get all links and recursively call the processPage method
    Elements questions = doc.select("a[href]");
    for(Element link: questions){
        String linkName = link.attr("abs:href");
        if(linkName.contains(testDomain.replaceAll("http://www.", ""))){
            if(linkName.contains("#")){
                linkName = linkName.substring(0, linkName.indexOf("#"));
            }
            if(linkName.contains("?")){
                linkName = linkName.substring(0, linkName.indexOf("?"));
            }
            if(!urlList.contains(linkName) && urlList.size() <= 500){

                urlList.add(linkName);
            }
        }
    }
}
catch(HttpStatusException e){
    System.out.println(e);
}
catch(SocketTimeoutException e){
    System.out.println(e);
}
catch(UnsupportedMimeTypeException e){
    System.out.println(e);
}
catch(UnknownHostException e){
    System.out.println(e);
}
catch(MalformedURLException e){
    System.out.println(e);
}
}

Each of your requests ( http://crawlmysite-tgugnani.rhcloud.com/getUrlCrawlData/www.google.com and http://crawlmysite-tgugnani.rhcloud.com/getUrlCrawlData/www.yahoo.com ) is processed in a separate thread. 您的每个请求( http://crawlmysite-tgugnani.rhcloud.com/getUrlCrawlData/www.google.comhttp://crawlmysite-tgugnani.rhcloud.com/getUrlCrawlData/www.yahoo.com )都是单独处理的线。 You have two instances of the crawlURL() method working simultaneously, but both methods use the same variables ( testDomain , urlList , urlMap and urldata ). 你有两个实例crawlURL()同时工作方法,但是这两种方法使用相同的变量( testDomainurlListurlMapurldata )。 So they mess up each other's data in these variables. 所以他们搞砸了这些变量中的彼此数据。

One way to fix the problem is to declare these variables locally (inside the method). 解决问题的一种方法是在本地声明这些变量(在方法内)。 This way, new instances of these variables will be created for each invocation of crawlURL() . 这样,将为每次crawlURL()调用创建这些变量的新实例。 Alternatively, you can create a new instance of your CrawlService class for each invocation of the crawlURL() method. 或者,您可以为crawlURL()方法的每次调用创建CrawlService类的新实例。

Synchronizing threads would be a bad idea here because one requests will wait for another to complete before it can be processed by crawlURL() . 同步线程在这里是一个坏主意,因为一个请求将在crawlURL()处理之前等待另一个请求完成。

As far as SpringMVC is concerned every request running in separate thread. 就SpringMVC而言,每个请求都在单独的线程中运行。 So I think problem is in crawlService which, I suppose, is not stateless (singleton-like). 所以我认为问题出现在crawlService中,我想,这不是无状态的(类似单身)。 Try to create new crawl service for every request and check if your data is not mixed. 尝试为每个请求创建新的抓取服务,并检查您的数据是否未混合。 If creating crawl service is expensive operation you should rewrite it to work in stateless way. 如果创建爬网服务是昂贵的操作,您应该重写它以无状态方式工作。

@RequestMapping("/getUrlCrawlData/{domain:.+}")
@ResponseBody
public String registerContact(@PathVariable("domain") String domain) throws HttpStatusException, SQLException, IOException {

    Gson gson = new Gson();
    List<URLdata> = new CrawlService().crawlURL("http://"+domain);
    return gson.toJson(urldata);
}

I think 我认为

urldata = crawlService.crawlURL("http://"+domain);

This call to crawl Service is the one which is affected by Multiple requests coming simultaneously. crawl Service此调用是受同时发出的Multiple requests影响的crawl Service

check whether crawlService is safe from multithreading . 检查crawlService是否可以安全地进行multithreading

ie check whether crawlURL() method is synchronized , if not make it synchronized . 即检查crawlURL()方法是否synchronized ,如果不synchronized

or else synchronize the block of calling crawlservice inside controller . 或者在controllersynchronize the block调用crawlservice synchronize the block

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM