简体   繁体   中英

How to follow a redirect with urllib?

I'm creating a script in Python 3 which access a page like:

example.com/daora/zz.asp?x=qqrzzt

using the urllib.request.urlopen("example.com/daora/zz.asp?x=qqrzzt"), but this code just give me the same page(example.com/daora/zz.asp?x=qqrzzt) and on the browser i get a redirect to a page like:

example.com/egg.aspx

What could i do to retrieve the

example.com/egg.aspx

and not the

example.com/daora/zz.asp?x=qqrzzt

I think this is relevant code, this is the code from "example.com/daora/zz.asp?x=qqrzzt":

<head>

<script language="JavaScript">

<!--
    function Submit()

    {
        document.formzz.submit();
    }
-->
</script>

</head>

<body bgcolor="#FFFFFF" leftmargin="0" topmargin="0" marginwidth="0" marginheight="0" onLoad="javascript:Submit();">

<form name="formZZ" method="post" action="http://example.com/egg.aspx">

<input type="hidden" name="token" value="UFASGFJKASGDJFGAJS">

</form>

urllib.request follows redirects automatically; you don't need to do anything.

The problem here is that there is no redirect to follow. The web page uses Javascript to fake a form submission as soon as it's loaded. urllib just fetches the page; it doesn't implement a browser DOM and run Javascript code.

Depending on how general you need your script to be, the simplest solution may be something hacky. For example, if you're just trying to spider 500 pages that all have a similar structure but different details, just find the action of the first form and navigate to that.

Also, if fetching the pages and processing them are two distinct steps, you may want to write a fetcher with super-simple Javascript/Greasemonkey (running in the browser, so it's already got a working DOM implementation, etc.) and a separate fancy processing script in Python (which just operates on the finally-fetched/generated HTML pages).

If you need to be fully general, the simplest solution is probably to use the selenium browser automation framework. (Or, maybe, PyWin32 or PyObjC to automate IE or Webkit directly.)

If you want the best possible solution, and have infinite resources… write your own implementation of the DOM and hook up your favorite Javascript interpreter (probably spidermonkey or v8). That's only about 2/3rds as much work as writing a new browser. (And you may be able to find pieces that get you 80% of the way there. For example, if you're willing to use Jython instead of CPython as your Python interpreter, HtmlUnit is pretty slick.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM