简体   繁体   中英

aiohttp: client.get() returns html tag rather than file

I'm trying to downloads bounding box files (stored as gzipped tar archives) from image-net.org. When I print(resp.read()) , rather than a stream of bytes representing the archive, I get the HTML b'<meta http-equiv="refresh" content="0;url=/downloads/bbox/bbox/[wnid].tar.gz" />\\n where [wnid] refers to a particular wordnet identification string. This leads to the error tarfile.ReadError: file could not be opened successfully . Any thoughts on what exactly is the issue and/or how to fix it? Code is below ( images is a pandas data frame).

def get_boxes(images, nthreads=1000):

    def parse_xml(xml):
        return 0

    def read_tar(data, wnid):
        bytes = io.BytesIO(data)
        tar = tarfile.open(fileobj=bytes)
        return 0

    async def fetch_boxes(wnid, client):
        url = ('http://www.image-net.org/api/download/imagenet.bbox.'
            'synset?wnid={}').format(wnid)
        async with client.get(url) as resp:
            res = await loop.run_in_executor(executor, read_tar,
                await resp.read(), wnid)
            return res

    async def main():
        async with aiohttp.ClientSession(loop=loop) as client:
            tasks = [asyncio.ensure_future(fetch_boxes(wnid, client))
                for wnid in images['wnid'].unique()]
            return await asyncio.gather(*tasks)

    loop = asyncio.get_event_loop()
    executor = ThreadPoolExecutor(nthreads)
    shapes, boxes = zip(*loop.run_until_complete(main()))
    return pd.concat(shapes, axis=0), pd.concat(boxes, axis=0)

EDIT: I understand now that this is a meta refresh used as a redirect. Would this be considered a "bug" in `aiohttp?

This is ok.

Some services have redirects from user-friendly web-page to a zip-file. Sometimes it is implemented using HTTP status (301 or 302, see example below) or using page with meta tag that contains redirect like in your example.

HTTP/1.1 302 Found
Location: http://www.iana.org/domains/example/

aiohttp can handle first case - automatically (when allow_redirects = True by default).
But in the second case library retrieves simple HTML and can't handle that automatically.

I run into the same problem \\n when I tried to download using wget from the same url as you did http://www.image-net.org/api/download/imagenet.bbox.synset?wnid=n01729322

but it works if you input this directly www.image-net.org/downloads/bbox/bbox/n01729322.tar.gz

ps. n01729322 is the wnid

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM