简体   繁体   English

搜寻可能需要登录的网站-JSOUP

[英]Scraping a website that may require login - jsoup

I am trying to scrape a website that possibly requires authentication. 我正在尝试抓取可能需要身份验证的网站。 When I try the following code I get an error : 当我尝试以下代码时,出现错误:

org.jsoup.UnsupportedMimeTypeException: Unhandled content type. org.jsoup.UnsupportedMimeTypeException:未处理的内容类型。 Must be text/*, application/xml, or application/xhtml+xml. 必须为text / *,application / xml或application / xhtml + xml。 Mimetype=application/json; Mimetype = application / json; charset=utf-8, URL= https://sso.mims.com/Account/Signin at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:547) at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:493) at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:205) at com.aiingov.proc.MedScraper.main(MedScraper.java:49) charset = utf-8,URL = org.jsoup.helper.HttpConnection $ Response.execute(HttpConnection.java:547)上的org.jsoup.helper.HttpConnection $ Response上的https://sso.mims.com/Account/Signin org.jsoup.helper.HttpConnection.execute(HttpConnection.java:205)的com.aiingov.proc.MedScraper.main(MedScraper.java:49)的.execute(HttpConnection.java:493)

public static void main(String[] args) throws IOException {

String url = "https://sso.mims.com/Account/Signin";
            String userAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.130 Safari/537.36";

            Connection.Response response = Jsoup.connect(url).userAgent(userAgent)
                    .method(Connection.Method.GET)
                    .execute();

            response = Jsoup.connect(url)
                    .cookies(response.cookies())
                    .data("action", "login")
                    .data("login", "xxxxx")
                    .data("password", "xxxxx")
                    .data("auto_login", "1")
                    .userAgent(userAgent)
                    .method(Connection.Method.POST)
                    .followRedirects(true)
                    .execute();           

            Document document = Jsoup.connect("https://www.mims.com/india/drug/info/abacavir/abacavir?type=full&mtype=generic")
                    .cookies(response.cookies())
                    .userAgent(userAgent)
                    .get();

            System.out.println(document);

            Elements elements = document.body().select("*");

               for (Element element : elements) {
                   System.out.println(element.ownText());
               }

Without the login code in place I get the following output: 没有适当的登录代码,我将得到以下输出:

You will be redirected to your destination shortly. 您将很快被重定向到目的地。

How do I fix this? 我该如何解决?

Try using the ignoreContentType method. 尝试使用ignoreContentType方法。

 Jsoup.connect(url).ignoreContentType(true);//chain any other methods

Description of the method from the JSoup Docs: JSoup Docs中的方法说明:

Ignore the document's Content-Type when parsing the response. 解析响应时,忽略文档的Content-Type。 By default this is false, an unrecognised content-type will cause an IOException to be thrown. 默认情况下为false,无法识别的内容类型将引发IOException。 (This is to prevent producing garbage by attempting to parse a JPEG binary image, for example.) Set to true to force a parse attempt regardless of content type. (例如,这是为了防止通过尝试解析JPEG二进制图像而产生垃圾。)设置为true可以强制尝试进行解析,而与内容类型无关。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM