简体   繁体   English

从Java应用程序发送带有参数的POST请求时出现问题

[英]Problem sending a POST Request with Parameters From a Java app

There's a web page with a search engine: 有一个带有搜索引擎的网页:

http://www.nukat.edu.pl/cgi-bin/gw_48_1_12/chameleon?sessionid=2010010122520520752&skin=default&lng=pl&inst=consortium&search=KEYWORD&function=SEARCHSCR&SourceScreen=NOFUNC&elementcount=1&pos=1&submit=TabData http://www.nukat.edu.pl/cgi-bin/gw_48_1_12/chameleon?sessionid=2010010122520520752&skin=default&lng=pl&inst=consortium&search=KEYWORD&function=SEARCHSCR&SourceScreen=NOFUNC&elementcount=1&pos=1&submit=TabData

I want to use its search engine from a java application. 我想从Java应用程序中使用其搜索引擎。

Currently I'm trying to send a very simple request - only one field filled and no logical operators. 目前,我正在尝试发送一个非常简单的请求-仅填充一个字段,而没有逻辑运算符。

This is my code: 这是我的代码:

try {
    URL url = new URL( nukatSearchUrl );
    URLConnection urlConn = url.openConnection();
    urlConn.setDoInput( true );
    urlConn.setDoOutput( true );
    urlConn.setUseCaches( false );
    urlConn.setRequestProperty( "Content-Type", "application/x-www-form-urlencoded" );
    BufferedWriter out = new BufferedWriter( new OutputStreamWriter( urlConn.getOutputStream() ) );
    String content = "t1=" + URLEncoder.encode( "Duma Key", "UTF-8" );
    out.write( content );
    out.flush();
    out.close();
    BufferedReader in = new BufferedReader( new InputStreamReader( urlConn.getInputStream() ) );

    String rcv = null;
    while ( ( rcv = in.readLine() ) != null ) {
        System.out.println( rcv );
    }
    fd.close();
    in.close();
} catch ( Exception ex ) {
    throw new SearchEngineException( "NukatSearchEngine.search() : " + ex.getMessage() );
}

Unfortunateley what I keep getting is the main site - looks like this: 我一直得到的不幸的是主要站点-看起来像这样:

<cant post the link to the main site :/>

Not the search results I'm expecting. 不是我期望的搜索结果。

What could be wrong here? 这有什么问题吗?

The URL may be wrong or your request is likely incomplete. 网址可能有误,或者您的请求可能不完整。 You need to check the HTML source (rightclick page > View Source ) and use the same URL as definied in the <form action> and gather all request parameters (including those from hidden input fields and the button which you intend to "press"!) for use in your query string. 您需要检查HTML源代码(右键单击页面> 查看源代码 ),并使用与<form action>中定义的URL相同的URL,并收集所有请求参数(包括来自隐藏输入字段和打算“按下”按钮的参数)! )以用于您的查询字符串。

That said, doing so is in most cases a policy violation and may result in your IP become blacklisted. 也就是说,在大多数情况下,这样做是违反政策的,并且可能导致您的IP被列入黑名单。 Please check their robots.txt and the "Terms of use" -if any, I don't understand Polish. 请检查他们的robots.txt和“使用条款”-如果有的话,我不懂波兰语。 Their robots.txt at least says that everyone is disallowed to access the entire website programmatically . 他们的robots.txt至少说每个人都不允许以编程方式访问整个网站 Use it on your own risks. 使用它需要您自担风险。 You've been warned. 您已被警告。 Better contact them and ask if they have any public webservice and then use it instead. 最好与他们联系,询问他们是否有任何公共Web服务,然后改用它。

You can always spoof the user-agent request header with a real-looking string as extracted from a real webbrowser to minimize the risk to get recognized as a bot as pointed out by Bozho here, but you can still get caught on based on the visitor patterns/statistics. 您始终可以使用从真实Web浏览器中提取的真实外观的字符串来欺骗user-agent请求标头,以最大程度地减少Bozho指出的被识别为漫游器的风险,但是您仍然可以根据访问者来抓住它图/统计。

I wound't go any further with this after reading BalusC's answer. 阅读BalusC的答案后,我对此不做任何进一步的介绍。 Here are, however, a few pointers, if you don't worry of being blacklisted: 但是,如果您不必担心将其列入黑名单,则可以使用以下几点建议:

  • set the User-Agent header to pretend being a browser. User-Agent标头设置为假装为浏览器。 for example 例如

     urlConn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB6"); 
  • you can use a simulation of a human user in firefox, using Selenium WebDriver 您可以使用Selenium WebDriver在firefox中模拟人类用户

An easy way to see all activity that you need to replicate is the Live HTTP Headers Firefox Extension . Live HTTP标头Firefox扩展是查看所有需要复制的所有活动的简便方法。 To see all form elements on the page, Firebug is useful. 要查看页面上的所有表单元素, Firebug很有用。 Finally, I often use a fake server that I control to see what the browser is sending, and compare to my application. 最后,我经常使用我控制的假服务器来查看浏览器发送的内容,并与我的应用程序进行比较。 I rolled my own, just a small Java server that prints out everything sent to it - inverse telnet, if you will. 我推出了自己的服务器,只是一台小型Java服务器,可以打印出发送给它的所有内容-反向telnet(如果愿意)。

Another note is that some sites deny access based on the User-Agent, ie you might need to get your application to pretend it's Firefox. 另一个要注意的是,某些站点基于User-Agent拒绝访问,即您可能需要让您的应用程序假装是Firefox。 This is very bad practice, and a little dishonest. 这是非常糟糕的做法,并且有些不诚实。 As BalusC mentioned, check their usage policy and robots.txt! 如BalusC所述,请检查其使用政策和robots.txt! I would also recommend asking permission if you intend to spread your application around. 我还建议您寻求许可,如果您打算将您的应用程序散布开来。

Finally, I happen to be working on something similar and you might find the following code useful (it writes a mapping of key -> lists of values to the correct POST format): 最后,我碰巧正在做类似的事情,您可能会发现以下代码很有用(它将键的映射->值列表写入正确的POST格式):

        StringBuilder builder = new StringBuilder();
        try {
            boolean first = false;
            for(Entry<String,List<String>> entry : data.entrySet()) {
                for(String value : entry.getValue()) {
                    if(first) {
                        first = false;
                    }
                    else {
                        builder.append("&");
                    }
                    builder.append(URLEncoder.encode(entry.getKey(), "UTF-8") + "=" + URLEncoder.encode(value, "UTF-8"));   
                }
            }
        } catch (UnsupportedEncodingException e1) {
            return false;
        }
        conn.setDoOutput(true);
        try {
            OutputStreamWriter wr = new OutputStreamWriter(conn.getOutputStream());
            wr.write(builder.toString());
            wr.flush();
            conn.connect();
        } catch (IOException e) {
            return(false);
        }

As well as the user-agent it could also be using cookies to check that the search is being sent from the search page. 除用户代理外,它还可以使用Cookie来检查是否从搜索页面发送了搜索。

HttpClient is good for automating form submission including handling any cookies and pretending to be a browser. HttpClient非常适合自动提交表单,包括处理所有cookie并假装成为浏览器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM