简体   繁体   English

HTML抓取网站加载错误的Jsoup Java

[英]Html scraping Site Loads Wrong Jsoup Java

I'm trying to run a script to pull information from a site however when I compare the actual website to the site that my program shows it is not the same. 我正在尝试运行脚本以从网站中提取信息,但是,当我将实际网站与我的程序显示的网站不同时,我会这样做。

some examples of what is missing is the beginning !doctype and the companies' info http://www.manta.com/mb_43_E7_24/manufacturing/minnesota 缺少的一些示例是开始!doctype和公司信息http://www.manta.com/mb_43_E7_24/manufacturing/minnesota

I'm not sure if javascript is part of the issue, i tried turning it off and it still worked, but i also noticed there is a lot of javascript in it; 我不确定javascript是否是问题的一部分,我尝试将其关闭并且仍然有效,但是我也注意到其中包含很多javascript; no login is required for the website. 该网站无需登录。 Maybe cookies?(I don't know much about cookies) 也许是cookie?(我对cookie不太了解)

String keyword = "http://www.manta.com/mb_43_E7_24/manufacturing/minnesota.php";
Document doc = Jsoup.connect(keyword).referrer("http://www.google.com").userAgent("Mozilla/5.0 (Windows; U;     WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6").get();
System.out.Println(doc.toString());

Above is the code im using 上面是我正在使用的代码

Any ideas why it is failing to load my page the way that my browser does? 有什么想法为什么它无法像浏览器那样加载页面? At one point I had it working but I accidently broke it 曾经有一段时间我可以使用它,但是我不小心弄坏了它

And do you have any recommendations to a solution if this solution would not be a reasonable one to pull information from a website? 如果该解决方案不是从网站上获取信息的合理方案,那么您对该解决方案有何建议?

I PUT SOME MORE WORK IN AND FOUND THAT IT WORKS FOR http://www.manta.com/ but not if I add the suffex String /mb_43_E7_24/manufacturing/minnesota.php 我投入了更多工作,并发现该方法可用于http://www.manta.com/,但如果添加后缀字符串/mb_43_E7_24/manufacturing/minnesota.php则不会。

Is the suffex in anyway involved? 反正还有后缀吗?

Or might it be the site temporarily banning me for too many requests? 还是网站暂时禁止我提出太多要求?

Jsoup does not execute/ render the Javascript. Jsoup不执行/呈现Javascript。 HTMLUnit has a headless browser which renders the full page and returns the content as a String. HTMLUnit具有无头浏览器,该浏览器呈现整个页面并以String形式返回内容。 Selenium is useful as well. 硒也是有用的。 Selenium has WebDrivers for Firefox, Chrome, IE, and HTMLUnit. Selenium具有适用于Firefox,Chrome,IE和HTMLUnit的WebDrivers。 I have used the below code to execute the Javascript and return the html. 我已使用以下代码执行Javascript并返回html。 I have found this useful for news sites I wish to scrape who use Javascript for the comments section. 我发现这对于希望刮擦使用Javascript作为评论部分的新闻站点很有用。

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.htmlunit.HtmlUnitDriver;

public class Test{

  private WebDriver driver;
  private String output; 

  public Document getDocument(String input) {
    driver = new HTMLUnitDriver(true); //the param true turns on javascript.
    driver.get(input);
    output = driver.getPageSource();
    driver.quit();
    return Jsoup.parse(output);
  }
}

The above code should be enough to get started... 上面的代码应该足以开始...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM