简体   繁体   English

使用 Jsoup.connect() 时会发生什么? 为什么这么慢?

[英]What happens when using Jsoup.connect()? Why is it so slow?

I'm using the following line to load document我正在使用以下行加载文档

Document doc = Jsoup.connect("http://www.some.site.with.lotsof.images/")
        .header("Accept-Encoding", "gzip, deflate")
        .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
        .maxBodySize(0)
        .timeout(600000).get();  // So Slow (~10 Seconds)
Elements lyricList = doc.getElementsByClass("some-class");

I need only the src of the images.我只需要图像的 src。 So i need only the plain html text to be loaded.所以我只需要加载纯 html 文本。

Is the line slow because of loading images from the URL?线路是否因为从 URL 加载图像而变慢?

I mean does Jsoup.connect() wait for the whole page to be loaded along with the images?我的意思是 Jsoup.connect() 是否等待整个页面与图像一起加载?

instead of using Jsoup for fetching and parsing, try combining OkHttp for fetching and Jsoup for parsing而不是使用Jsoup进行获取和解析,尝试结合OkHttp进行获取和Jsoup进行解析

OkHttpClient okHttp = new OkHttpClient();
Request request = new Request.Builder().url("https://example.com").get().build();
Document doc = Jsoup.parse(okHttp.newCall(request).execute().body().string());

it made a great difference in my case, here are the average results for a simple benchmark I ran:它对我的情况产生了很大的影响,以下是我运行的简单基准测试的平均结果:

okHttp+Jsoup: 283ms okHttp+Jsoup:283ms

Jsoup: 476ms Jsoup:476ms

Jsoup connection might become slow because of:由于以下原因, Jsoup连接可能会变慢:

  • your internet connection speed or您的互联网连接速度
  • CPU usage (Some other program is eating up memory!) or CPU 使用率(某些其他程序正在占用内存!)或
  • the respond speed of the web server you are accessing.您正在访问的 Web 服务器的响应速度

I've been scraping thousands of pages and the above three ( especially the third one ) have been the most likely problems to slow down Jsoup.connect() .我一直在抓取数千页,上面三个(尤其是第三个)是最有可能降低Jsoup.connect()速度的问题。 In your case I believe it's the web server you are trying to connect to that is slowing down your connection because Jsoup does not wait for images to load, it gets you whatever the initial html response from the server is.在您的情况下,我相信您尝试连接的 Web 服务器会减慢您的连接速度,因为Jsoup不会等待图像加载,无论来自服务器的初始html响应是什么,它都会为您提供。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM