简体   繁体   中英

Can't stream files from Amazon s3 using requests

I'm trying to stream crawl data from Common Crawl, but Amazon s3 errors when I use the stream=True parameters to get requests. Here is an example:

resp = requests.get(url, stream=True)
print(resp.raw.read())

When I run this on a Common Crawl s3 http url, I get the response:

b'<?xml version="1.0" encoding="UTF-8"?>\n<Error><Code>NoSuchKey</Code>
<Message>The specified key does not exist.</Message><Key>crawl-data/CC-
MAIN-2018-05/segments/1516084886237.6/warc/CC-
MAIN-20180116070444-20180116090444-00000.warc.gz\n</Key>
<RequestId>3652F4DCFAE0F641</RequestId><HostId>Do0NlzMr6
/wWKclt2G6qrGCmD5gZzdj5/GNTSGpHrAAu5+SIQeY15WC3VC6p/7/1g2q+t+7vllw=
</HostId></Error>'

I am using warcio, and need a streaming file object as input to the archive iterator, and a can't download the file all at once because of limited memory. What should I do?

PS. The url I request in the example is https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-05/segments/1516084886237.6/warc/CC-MAIN-20180116070444-20180116090444-00000.warc.gz

There is an error in your url. Compare the key in the response you are getting:

<Key>crawl-data/CC-
MAIN-2018-05/segments/1516084886237.6/warc/CC-
MAIN-20180116070444-20180116090444-00000.warc.gz\n</Key>

to the one in the intended url:

https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-05/segments/1516084886237.6/warc/CC-MAIN-20180116070444-20180116090444-00000.warc.gz

For some reason you are adding unnecessary whitespace, probably picked up during file reading ( readline() will give you trailing '\\n' characters on every line). Maybe try calling .strip() to remove the trailing newline.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM