使用python请求屏蔽作为浏览器并下载文件

Question

I'm trying to use the python requests library to download a file from this link: http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download 我正在尝试使用python请求库从以下链接下载文件： http : //www.nasdaq.com/screening/companies-by-industry.aspx? exchange=NASDAQ&render=download

Clicking on this link will give you a file (nasdaq.csv) only when using a browser. 仅在使用浏览器时，单击此链接将为您提供文件（nasdaq.csv）。 I used the Firefox Network Monitor Ctrl-Shift-Q to retrieve all the headers that Firefox sends. 我使用Firefox网络监视器Ctrl-Shift-Q来检索Firefox发送的所有标头。 So now I finally get a 200 server response but still no file. 所以现在我终于得到了200服务器响应，但仍然没有文件。 The file that this script produces contains parts of the Nasdaq website, not the csv data. 该脚本生成的文件包含Nasdaq网站的一部分，而不是csv数据。 I looked at similar questions on this site and nothing leads me to believe that this shouldn't be possible with the requests library. 我在这个网站上看过类似的问题，但没有什么让我相信请求库不可能做到这一点。

Code: 码：

import requests

url = "http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download"

# Fake Firefox headers 
headers = {"Host" : "www.nasdaq.com", \
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:42.0) Gecko/20100101 Firefox/42.0", \
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \
        "Accept-Language": "en-US,en;q=0.5", \
        "Accept-Encoding": "gzip, deflate", \
        "DNT": "1", \
        "Cookie": "clientPrefs=||||lightg; userSymbolList=EOD+&DIT; userCookiePref=true; selectedsymbolindustry=EOD,; selectedsymboltype=EOD,EVERGREEN GLOBAL DIVIDEND OPPORTUNITY FUND COMMON SHARES OF BENEFICIAL INTEREST,NYSE; c_enabled$=true", \
        "Connection": "keep-alive", }

# Get the list
response = requests.get(url, headers, stream=True)
print(response.status_code)

# Write server response to file
with open("nasdaq.csv", 'wb') as f:
        for chunk in response.iter_content(chunk_size=1024): 
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)

Answer 1

You don't need to supply any headers: 您不需要提供任何标题：

import requests

url = "http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download"

response = requests.get(url, stream=True)
print(response.status_code)

# Write server response to file
with open("nasdaq.csv", 'wb') as f:
    for chunk in response.iter_content(chunk_size=1024):
        if chunk: # filter out keep-alive new chunks
            f.write(chunk)

You can also just write the content: 您也可以只写内容：

import requests

# Write server response to file
with open("nasdaq.csv", 'wb') as f:
       f.write(requests.get(url).content)

Or use urlib: 或使用urlib：

urllib.urlretrieve("http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download","nasdaq.csv")

All methods give you the 3137 line csv file: 所有方法都会为您提供3137行的csv文件：

"Symbol","Name","LastSale","MarketCap","ADR TSO","IPOyear","Sector","Industry","Summary Quote",
"TFSC","1347 Capital Corp.","9.79","58230920","n/a","2014","Finance","Business Services","http://www.nasdaq.com/symbol/tfsc",
"TFSCR","1347 Capital Corp.","0.15","0","n/a","2014","Finance","Business Services","http://www.nasdaq.com/symbol/tfscr",
"TFSCU","1347 Capital Corp.","10","41800000","n/a","2014","Finance","Business Services","http://www.nasdaq.com/symbol/tfscu",
"TFSCW","1347 Capital Corp.","0.178","0","n/a","2014","Finance","Business Services","http://www.nasdaq.com/symbol/tfscw",
"PIH","1347 Property Insurance Holdings, Inc.","7.51","46441171.61","n/a","2014","Finance","Property-Casualty Insurers","http://www.nasdaq.com/symbol/pih",
"FLWS","1-800 FLOWERS.COM, Inc.","7.87","510463090.04","n/a","1999","Consumer Services","Other Specialty Stores","http://www.nasdaq.com/symbol/flws",
"FCTY","1st Century Bancshares, Inc","7.81","80612492.62","n/a","n/a","Finance","Major Banks","http://www.nasdaq.com/symbol/fcty",
"FCCY","1st Constitution Bancorp (NJ)","12.39","93508122.96","n/a","n/a","Finance","Savings Institutions","http://www.nasdaq.com/symbol/fccy",
"SRCE","1st Source Corporation","30.54","796548769.38","n/a","n/a","Finance","Major Banks","http://www.nasdaq.com/symbol/srce",
"VNET","21Vianet Group, Inc.","20.26","1035270865.78","51099253","2011","Technology","Computer Software: Programming, Data Processing","http://www.nasdaq.com/symbol/vnet",
   ...................................

If for some reason it does not work for you then you might need to upgrade your version of requests. 如果由于某种原因它对您不起作用，那么您可能需要升级您的请求版本。

Answer 2

You actually don't need those headers. 实际上，您不需要这些标题。 You don't even need to save to a file. 您甚至不需要保存到文件。

import requests
import csv

url = "http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download"
response = requests.get(url)
data = csv.DictReader(response.content.splitlines())
for row in data:
    print row

Sample output: 样本输出：

{'Sector': 'Technology', 'LastSale': '2.46', 'Name': 'Zynga Inc.', '': '', 'Summary Quote': 'http://www.nasdaq.com/symbol/znga', 'Symbol': 'ZNGA', 'Industry': 'EDP Services', 'MarketCap': '2295110123.7', 'IPOyear': '2011', 'ADR TSO': 'n/a'}

You can use csv.reader instead of DictReader if you like. 如果愿意，可以使用csv.reader代替DictReader 。

Answer 3

An alternative, and shorter, solution for this problem would be: 针对此问题的另一种更简短的解决方案是：

import urllib

downloadFile = urllib.URLopener()
downloadFile.retrieve("http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download", "companylist.csv")

This code uses the URL Library to create URL Request object ( downloadFile ) and then it retrieves the data from the NASDAQ link and saves it as companylist.csv . 此代码使用URL库创建URL请求对象（ downloadFile ），然后从NASDAQ链接检索数据并将其保存为companylist.csv 。

According to the Python documentation, if you want to send a custom User-Agent (such as the Firefox User-Agent), you can subclass URLopener and set the version attribute to the user-agent you would like to use. 根据Python文档，如果要发送自定义的User-Agent（例如Firefox User-Agent），则可以将URLopener子类URLopener ，并将version属性设置为要使用的user-agent。

Note : According to the Python documentation, as of Python v3.3, urllib.URLopener() is deprecated. 注意： 根据Python文档，从Python v3.3开始，不推荐使用urllib.URLopener() 。 As such, it may eventually be removed from the Python standards. 因此，它最终可能会从Python标准中删除。 However, as of Python v3.6 (Dev), urllib.URLopener() is still supported as a legacy interface. 但是，从Python urllib.URLopener() （Dev）开始，仍支持urllib.URLopener()作为旧版接口。

使用python请求屏蔽作为浏览器并下载文件

问题描述

3 个解决方案

解决方案1
3 已采纳 2015-12-12 19:19:17

解决方案2
0 2015-12-12 19:26:36

解决方案3
0 2015-12-12 20:14:58

使用python请求屏蔽作为浏览器并下载文件

问题描述

3 个解决方案

解决方案1 3 已采纳 2015-12-12 19:19:17

解决方案2 0 2015-12-12 19:26:36

解决方案3 0 2015-12-12 20:14:58

解决方案1
3 已采纳 2015-12-12 19:19:17

解决方案2
0 2015-12-12 19:26:36

解决方案3
0 2015-12-12 20:14:58