简体   繁体   English

尝试抓取 web 页面时出现“无 javascript”错误

[英]“no javascript” error when trying to scrape web page

public class SearchWalm {
    public static void main(String[] args) throws IOException, InterruptedException {
        HttpClient client = HttpClient.newHttpClient();
        HttpRequest request = HttpRequest.newBuilder()
                .uri(URI.create("https://www.walmart.ca/en/electronics/N-32+103/page-13?sortBy=newest&orderBy=DESC"))
                .GET()
                .build();

        HttpResponse<String> response = client.send(request,
                HttpResponse.BodyHandlers.ofString());

        System.out.println(response.body());
    }
}

I'm trying to write a program that will sift through pages on the walmart clearance section then select a keyword and tell me what page it found the keyword on.我正在尝试编写一个程序,该程序将筛选沃尔玛清关部分的页面,然后筛选 select 关键字并告诉我它在哪个页面上找到了该关键字。

I'm getting the error "no javascript" and "your web browser is not running javascript".我收到错误“无 javascript”和“您的 web 浏览器未运行 javascript”。 Do I need to run this through a browser or is there a Java only way of doing this?我需要通过浏览器运行它还是只有 Java 这样做?

Headless browser can solve scraping problems but unfortunately, your website loads content on-demand using javascript.无头浏览器可以解决抓取问题,但不幸的是,您的网站使用 javascript 按需加载内容。 To scrape on-demand data load, need an actual browser.要抓取按需数据加载,需要一个实际的浏览器。

We use Jsoup and Selenium WebDeiver to solve this problem.我们使用JsoupSelenium WebDeiver来解决这个问题。 Selenium WebDriver can allow Implicitly Wait(you set a timer) or Fluent Wait. Selenium WebDriver 可以允许隐式等待(您设置计时器)或流利等待。 Using this wait, we will wait until desired data loaded completely.使用此等待,我们将等待所需数据完全加载。 After receiving we content, you parse data using jsoup and find out your desired result.收到我们的内容后,您使用 jsoup 解析数据并找出您想要的结果。

You also need Chrome/Firefox browser installed in your machine and need ChromeDriver/FirefoxDriver.您还需要在您的机器上安装 Chrome/Firefox 浏览器,并且需要 ChromeDriver/FirefoxDriver。

  • Mac users with Homebrew installed: brew tap homebrew/cask && brew cask install chromedriver安装 Homebrew 的 Mac 用户: brew tap homebrew/cask && brew cask install chromedriver
  • Debian based Linux distros: sudo apt-get install chromium-chromedriver Debian 基于 Linux 发行版:sudo apt-get install chromium-chromedriver
  • Windows users with Chocolatey installed: choco install chromedriver Windows 安装 Chocolatey 的用户:choco install chromedriver

Now you run the below code, which can search and show title from the search result.现在你运行下面的代码,它可以从搜索结果中搜索并显示标题。

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import java.io.IOException;
import java.util.concurrent.TimeUnit;


public class WebScraperJsOnload {
    public static void main(String[] args) throws IOException {
        
        String queryString = "https://www.walmart.ca/en/electronics/N-32+103/page-13?sortBy=newest&orderBy=DESC";
        
        WebDriver driver = new ChromeDriver();
        driver.manage().timeouts().implicitlyWait(20, TimeUnit.SECONDS);
        driver.get(queryString);
        
        Document doc = Jsoup.parse(driver.getPageSource());
        
        Elements newsHeadlines = doc.select(".title");
        for (Element headline : newsHeadlines) {
            log("Log: %s",headline.html());
        }
        
    }

    private static void log(String msg, String... vals) {
        System.out.println(String.format(msg, vals));
    }
}

Maven dependencies for this imports,此导入的 Maven 依赖项,

<dependency>
    <!-- jsoup HTML parser library @ https://jsoup.org/ -->
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.13.1</version>
</dependency>
<dependency>
    <groupId>org.seleniumhq.selenium</groupId>
    <artifactId>selenium-java</artifactId>
    <version>3.141.59</version>
</dependency>

This code output looks like,此代码 output 看起来像,

Log: <h2 class="thumb-header">Clearance Sale DR-E15 Fake Battery DC Coupler Battery Holder Mount Plate Power Supply Set Black</h2>
Log: <h2 class="thumb-header">Clearance Sale 2 in 1 4.7 inch Wireless U Disk Memory Expansion Phone Case for iPhone 6/6S/7 Red</h2>
Log: <h2 class="thumb-header">Clearance Sale USB Charging Power LED Selfie Ring Filling Light With Mobile Phone Clip Holder Black</h2>
Log: <h2 class="thumb-header">Clearance Sale Nillkin Protective Cover Plastic Hard Back Case Protect Mobile Phone Shell Red</h2>
Log: <h2 class="thumb-header">Clearance Sale Children'S Alarm Clock Creative Cute Cartoon Luminous Led Electronic Clock Pink</h2>
...
... 

For the complete project, download this Github Repo如需完整的项目,请下载此Github Repo

It's kinda a java thing.这有点像 java 的东西。 You send different set of headers when doing request inside Java.在 Java 中执行请求时,您发送不同的标头集。 I tried that url and works ok when you attach "Accept: */*" header.我试过 url 并且当您附加“接受:*/*”header 时工作正常。

You can't do it with your current implementation, reimplement it with HttpClient and add the missing header.你不能用你当前的实现来做,用HttpClient重新实现它并添加缺少的 header。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM