简体   繁体   English

使用python请求屏蔽作为浏览器并下载文件

[英]Using python requests to mask as a browser and download a file

I'm trying to use the python requests library to download a file from this link: http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download 我正在尝试使用python请求库从以下链接下载文件: http : //www.nasdaq.com/screening/companies-by-industry.aspx? exchange=NASDAQ&render=download

Clicking on this link will give you a file (nasdaq.csv) only when using a browser. 仅在使用浏览器时,单击此链接将为您提供文件(nasdaq.csv)。 I used the Firefox Network Monitor Ctrl-Shift-Q to retrieve all the headers that Firefox sends. 我使用Firefox网络监视器Ctrl-Shift-Q来检索Firefox发送的所有标头。 So now I finally get a 200 server response but still no file. 所以现在我终于得到了200服务器响应,但仍然没有文件。 The file that this script produces contains parts of the Nasdaq website, not the csv data. 该脚本生成的文件包含Nasdaq网站的一部分,而不是csv数据。 I looked at similar questions on this site and nothing leads me to believe that this shouldn't be possible with the requests library. 我在这个网站上看过类似的问题,但没有什么让我相信请求库不可能做到这一点。

Code: 码:

import requests

url = "http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download"

# Fake Firefox headers 
headers = {"Host" : "www.nasdaq.com", \
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:42.0) Gecko/20100101 Firefox/42.0", \
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \
        "Accept-Language": "en-US,en;q=0.5", \
        "Accept-Encoding": "gzip, deflate", \
        "DNT": "1", \
        "Cookie": "clientPrefs=||||lightg; userSymbolList=EOD+&DIT; userCookiePref=true; selectedsymbolindustry=EOD,; selectedsymboltype=EOD,EVERGREEN GLOBAL DIVIDEND OPPORTUNITY FUND COMMON SHARES OF BENEFICIAL INTEREST,NYSE; c_enabled$=true", \
        "Connection": "keep-alive", }

# Get the list
response = requests.get(url, headers, stream=True)
print(response.status_code)

# Write server response to file
with open("nasdaq.csv", 'wb') as f:
        for chunk in response.iter_content(chunk_size=1024): 
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)

You don't need to supply any headers: 您不需要提供任何标题:

import requests

url = "http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download"

response = requests.get(url, stream=True)
print(response.status_code)

# Write server response to file
with open("nasdaq.csv", 'wb') as f:
    for chunk in response.iter_content(chunk_size=1024):
        if chunk: # filter out keep-alive new chunks
            f.write(chunk)

You can also just write the content: 您也可以只写内容:

import requests

# Write server response to file
with open("nasdaq.csv", 'wb') as f:
       f.write(requests.get(url).content)

Or use urlib: 或使用urlib:

urllib.urlretrieve("http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download","nasdaq.csv")

All methods give you the 3137 line csv file: 所有方法都会为您提供3137行的csv文件:

"Symbol","Name","LastSale","MarketCap","ADR TSO","IPOyear","Sector","Industry","Summary Quote",
"TFSC","1347 Capital Corp.","9.79","58230920","n/a","2014","Finance","Business Services","http://www.nasdaq.com/symbol/tfsc",
"TFSCR","1347 Capital Corp.","0.15","0","n/a","2014","Finance","Business Services","http://www.nasdaq.com/symbol/tfscr",
"TFSCU","1347 Capital Corp.","10","41800000","n/a","2014","Finance","Business Services","http://www.nasdaq.com/symbol/tfscu",
"TFSCW","1347 Capital Corp.","0.178","0","n/a","2014","Finance","Business Services","http://www.nasdaq.com/symbol/tfscw",
"PIH","1347 Property Insurance Holdings, Inc.","7.51","46441171.61","n/a","2014","Finance","Property-Casualty Insurers","http://www.nasdaq.com/symbol/pih",
"FLWS","1-800 FLOWERS.COM, Inc.","7.87","510463090.04","n/a","1999","Consumer Services","Other Specialty Stores","http://www.nasdaq.com/symbol/flws",
"FCTY","1st Century Bancshares, Inc","7.81","80612492.62","n/a","n/a","Finance","Major Banks","http://www.nasdaq.com/symbol/fcty",
"FCCY","1st Constitution Bancorp (NJ)","12.39","93508122.96","n/a","n/a","Finance","Savings Institutions","http://www.nasdaq.com/symbol/fccy",
"SRCE","1st Source Corporation","30.54","796548769.38","n/a","n/a","Finance","Major Banks","http://www.nasdaq.com/symbol/srce",
"VNET","21Vianet Group, Inc.","20.26","1035270865.78","51099253","2011","Technology","Computer Software: Programming, Data Processing","http://www.nasdaq.com/symbol/vnet",
   ...................................

If for some reason it does not work for you then you might need to upgrade your version of requests. 如果由于某种原因它对您不起作用,那么您可能需要升级您的请求版本。

You actually don't need those headers. 实际上,您不需要这些标题。 You don't even need to save to a file. 您甚至不需要保存到文件。

import requests
import csv

url = "http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download"
response = requests.get(url)
data = csv.DictReader(response.content.splitlines())
for row in data:
    print row

Sample output: 样本输出:

{'Sector': 'Technology', 'LastSale': '2.46', 'Name': 'Zynga Inc.', '': '', 'Summary Quote': 'http://www.nasdaq.com/symbol/znga', 'Symbol': 'ZNGA', 'Industry': 'EDP Services', 'MarketCap': '2295110123.7', 'IPOyear': '2011', 'ADR TSO': 'n/a'}

You can use csv.reader instead of DictReader if you like. 如果愿意,可以使用csv.reader代替DictReader

An alternative, and shorter, solution for this problem would be: 针对此问题的另一种更简短的解决方案是:

import urllib

downloadFile = urllib.URLopener()
downloadFile.retrieve("http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download", "companylist.csv")

This code uses the URL Library to create URL Request object ( downloadFile ) and then it retrieves the data from the NASDAQ link and saves it as companylist.csv . 此代码使用URL库创建URL请求对象( downloadFile ),然后从NASDAQ链接检索数据并将其保存为companylist.csv

According to the Python documentation, if you want to send a custom User-Agent (such as the Firefox User-Agent), you can subclass URLopener and set the version attribute to the user-agent you would like to use. 根据Python文档,如果要发送自定义的User-Agent(例如Firefox User-Agent),则可以将URLopener子类URLopener ,并将version属性设置为要使用的user-agent。

Note : According to the Python documentation, as of Python v3.3, urllib.URLopener() is deprecated. 注意根据Python文档,从Python v3.3开始,不推荐使用urllib.URLopener() As such, it may eventually be removed from the Python standards. 因此,它最终可能会从Python标准中删除。 However, as of Python v3.6 (Dev), urllib.URLopener() is still supported as a legacy interface. 但是,从Python urllib.URLopener() (Dev)开始,仍支持urllib.URLopener()作为旧版接口。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM