简体   繁体   English

登录基于PHP的网站并抓取数据-问题

[英]Log into php-based site and scraping data - problems

I am creating a 3rd party java application (desktop) that needs to connect to a php-based site and log in to gather pertinent data. 我正在创建一个第三方Java应用程序(桌面),该应用程序需要连接到基于php的站点并登录以收集相关数据。 There is no accessible web service, no API, and every user will have their own secure login. 没有可访问的Web服务,没有API,并且每个用户都将拥有自己的安全登录名。 The site uses dojo (if that matters), and I am using Java HttpClient to send the post. 该网站使用dojo(如果有的话),而我正在使用Java HttpClient发送帖子。

HttpPost httppost = new HttpPost("https://thewebsite.net/index/login"); // .php ?
// Create a new HttpClient and Post Header
HttpClient httpclient = new DefaultHttpClient();

//initialize the response string    
String nextpage = "";

try {
    // Add nvps
    List<NameValuePair> nameValuePairs = new ArrayList<NameValuePair>(3);
    nameValuePairs.add(new BasicNameValuePair("", ""));
    nameValuePairs.add(new BasicNameValuePair("login", "USER"));
    nameValuePairs.add(new BasicNameValuePair("", ""));
    nameValuePairs.add(new BasicNameValuePair("pass", "PASSWORD"));
    nameValuePairs.add(new BasicNameValuePair("Submit", ""));

    httppost.setEntity(new UrlEncodedFormEntity(nameValuePairs));

HttpResponse response = httpclient.execute(httppost);
userID = EntityUtils.toString(response.getEntity());

System.out.println(nextpage);
httppost.releaseConnection();
}
...

Now, the issue I'm having is that the response given to me is a validation jscript for the user / pass fields through dojo. 现在,我遇到的问题是,给我的响应是通过dojo的用户/传递字段的验证jscript。

<script type='text/javascript'> 
dojo.require("dojox.validate._base"); 

function validate_RepeatPassword(val, constraints)
{
    var isValid = false; 

    if(constraints)  { 
        var otherInput =  dijit.byId(constraints[0]); 
        if(otherInput) { 
        var otherValue = otherInput.value; 
            isValid = (val == otherValue); 
        } 
    } 
    return isValid; 
}

</script>

I simply want to connect, parse an html response, and close the connection. 我只想连接,解析html响应并关闭连接。

When I use firebug, I get this as the post method, but I can't seem to get it to run: Referer https://thewebsite.net/index/login Source login=USER&pass=PASSWORD 当我使用firebug时,我将其作为post方法,但似乎无法使其运行:引用https://thewebsite.net/index/login源login = USER&pass = PASSWORD

When I use the HttpPost client to construct a direct post url without namevaluepairs: 当我使用HttpPost客户端构建不带namevaluepairs的直接发布url时:

HttpPost httppost = new HttpPost("https://thewebsite.net/index/login?login=USER&pass=PASSWORD"); 

, I get an error response that states "the user and pass fields cannot be left blank." ,我收到一个错误响应,指出“用户和密码字段不能留为空白”。

My question is: Is there a direct method to log in that is simpler that I'm missing that will allow me to successfully continue past log in? 我的问题是:是否有一种直接的登录方法可以使我成功地继续过去的登录,而这种方法更容易丢失?

Thanks - I love the SO community; 谢谢-我爱SO社区; hope you can help. 希望能对您有所帮助。

I think best library for doing this is jsoup 我认为最好的库是jsoup

Connection.Response res = 
Jsoup.connect("https://thewebsite.net/index/login?login=USER&pass=PASSWORD")
.method(Method.POST)
.execute();

After this you need to make verification also. 此后,您还需要进行验证。 You need to read cookies, request parameters and header parameters and this will work. 您需要读取cookie,请求参数和标头参数,这将起作用。

I didn't end up using your exact code (with the post parameters), but JSoup was the fix. 我最终并没有使用您的确切代码(带有post参数),但是JSoup是解决方案。

here's what I used: 这是我使用的:

`res = Jsoup.connect("https://thewebsite.net/index/login")
    .data("login", User).data("pass", Pass)
    .userAgent("Chrome").method(Method.POST).execute();

//then I grabbed the cookie and sent the next post for the data

Document t = res.parse(); //for later use
SessionID = res.cookie("UNIQUE_NAME");

//the JSON
Connection.Response driverx =     Jsoup.connect("https://thewebsite.net/datarequest/data").cookie("UNIQUE_NAME",SessionID).userAgent("Chrome").method(Method.POST).execute();`

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM