简体   繁体   English

使用Python ftplib下载时忽略丢失的文件

[英]Ignore missing file while downloading with Python ftplib

I am trying to download a certain file (named 010010-99999-year.gz) from an FTP server. 我正在尝试从FTP服务器下载某个文件(名为010010-99999-year.gz)。 This same file, but for different years is residing in different FTP directories. 同一文件,但是不同年份位于不同的FTP目录中。 For instance: 例如:

ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-lite/2000/010010-99999-1973.gz ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-lite/2001/010010-99999-1974.gz and so on. ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-lite/2000/010010-99999-1973.gz ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd -lite / 2001 / 010010-99999-1974.gz等。 The picture illustrates one of the directories: 该图说明了目录之一: 在此处输入图片说明

The file is not located in all the directories (ie all years). 该文件并非位于所有目录(即所有年份)中。 In such case I want the script to ignore that missing files, print "not available", and continue with the next directory (ie next year). 在这种情况下,我希望脚本忽略丢失的文件,打印“不可用”,并继续下一个目录(即明年)。 I could do this using the NLST listing by first generating a list of files in the current FTP directory and then checking if my file is on that list, but that is slow, and NOAA (the organization owning the server) does not like file listing ( source ). 我可以使用NLST列表执行此操作,方法是先在当前FTP目录中生成文件列表,然后检查我的文件是否在该列表中,但这很慢,并且NOAA(拥有服务器的组织)不喜欢文件列表( 来源 )。 Therefore I came up with this code: 因此,我想到了以下代码:

def FtpDownloader2(url="ftp.ncdc.noaa.gov"):
    ftp=FTP(url)        
    ftp.login()
    for year in range(1901,2015):
        ftp.cwd("/pub/data/noaa/isd-lite")
        ftp.cwd(str(year))
        fullStationId="010010-99999-%s.gz" % year
        try:              
            file=open(fullStationId,"wb")
            ftp.retrbinary('RETR %s' % fullStationId, file.write)
            print("File is available")
            file.close()
        except: 
            print("File not available")
    ftp.close()

This downloads the existing files (year 1973-2014) correctly, but it is also generating empty files for years 1901-1972. 这会正确下载现有文件(1973-2014年),但也会生成1901-1972年的空文件。 The file is not in the FTP for 1901-1972. 该文件不在1901-1972年的FTP中。 Am I doing anything wrong in the use of try and except, or is it some other issue? 使用try和except时我做错什么了吗,还是其他问题?

I took your code and modified it a little: 我拿了您的代码并对其进行了一些修改:

from ftplib import FTP, error_perm
import os

def FtpDownloader2(url="ftp.ncdc.noaa.gov"):
    ftp = FTP(url)
    ftp.login()
    for year in range(1901, 2015):
        remote_file = '/pub/data/noaa/isd-lite/{0}/010010-99999-{0}.gz'.format(year)
        local_file = os.path.basename(remote_file)
        try:
            with open(local_file, "wb") as file_handle:
                ftp.retrbinary('RETR %s' % remote_file, file_handle.write)
            print('OK', local_file)
        except error_perm:
            print('ERR', local_file)
            os.unlink(local_file)
    ftp.close()

Notes 笔记

  • The most dangerous operation a person can do is to have an except clause without a specific exception class. 一个人可以做的最危险的操作是拥有一个没有特定异常类的except子句。 This type of construct will ignore all errors, making it hard to troubleshoot. 这种类型的构造将忽略所有错误,因此很难进行故障排除。 To fix this, I added the specific exception error_perm 为了解决这个问题,我添加了特定的异常error_perm
  • Once the exception occurred, I absolutely know for sure that the local file is closed because the with statement guarantees that 一旦发生异常,我绝对可以确定本地文件已关闭,因为with语句可以保证
  • I removed the local file if error_perm exception occurred, a sign that the file is not available from the server 如果发生error_perm异常,我删除了本地文件,这表明该文件在服务器上不可用
  • I removed the code to change directories: for each year, you cwd twice which slows down the process 我删除了更改目录的代码:每年,您两次cwd ,这会减慢该过程
  • range(1901, 2015) will not include 2015. If you want it, you have to specify range(1901, 2016) range(1901, 2015)将不包括2015。如果需要,您必须指定range(1901, 2016)
  • I improved the print statements to include the file names, making it easier to track which ones are available and which ones are not 我改进了打印语句以包括文件名,从而可以更轻松地跟踪哪些可用和哪些不可用

Update 更新资料

This update answers your question regarding not creating empty local file (then having to delete them). 此更新回答您有关不创建空本地文件(然后必须删除它们)的问题。 There are a couple of different ways: 有两种不同的方式:

  1. Query the remote file's existence before downloading. 下载前查询远程文件的存在。 Only create the local file when the remote exists. 仅在远程存在时创建本地文件。 The problem with this approach is querying a remote file takes longer than creating/deleting a local file. 这种方法的问题是,查询远程文件比创建/删除本地文件花费的时间更长。
  2. Create a string buffer (StringIO), download to that buffer. 创建一个字符串缓冲区(StringIO),下载到该缓冲区。 Only create a local file when that string buffer is not empty. 仅在该字符串缓冲区不为空时创建本地文件。 The problem with this approach is you are writing the same data twice: once to the string buffer, and once from the string buffer to the file. 这种方法的问题是您将两次写入相同的数据:一次写入字符串缓冲区,一次写入字符串缓冲区至文件。

I think the problem is within your try: except block, where you keep a file handler open for a new file before checking if the file exists or not: 我认为问题出在您的try:except块之内,您可以在检查文件是否存在之前为新文件打开文件处理程序:

try:              
    file=open(fullStationId,"wb")
    ftp.retrbinary('RETR %s' % fullStationId, file.write)
    print("File is available")
    file.close()
except: 
    print("File not available")

Instead, add an additional statement in the except block to close the file handler, and another statement to remove the file if it is empty. 而是在except块中添加一条附加语句以关闭文件处理程序,并在该文件为空时添加另一条语句以删除文件。

Another possibility is to open the file for writing locally only if the file exists and has a non zero size on the server using ftp.size 另一种可能性是,仅当文件存在并且使用ftp.size在服务器上的大小为非零时,才打开该文件以进行本地写入

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM