简体   繁体   English

zipfile.Zipfile 打开 Zip 文件夹中的特定文件

[英]zipfile.Zipfile open specific file within a Zip folder

I'm new to Python and I'm trying to build a program that downloads and extracts zip files from various websites.我是 Python 的新手,我正在尝试构建一个从各个网站下载和提取 zip 文件的程序。 I've pasted the two programs I've written to do this.我已经粘贴了我为此编写的两个程序。 The first program is a "child" program names "urls", which I import to the second program.第一个程序是一个名为“urls”的“子”程序,我将其导入第二个程序。 I'm trying to iterate through each of the urls, and within each url iterate through each data file, and finally check if the "keywords" list is a part of the file name, and if yes, download and extract that file.我正在尝试遍历每个 url,并在每个 url 中遍历每个数据文件,最后检查“关键字”列表是否是文件名的一部分,如果是,下载并提取该文件。 I'm getting stuck on the part where I need to loop through the list of "keywords" to check against the file names I want to download.我陷入了需要遍历“关键字”列表以检查要下载的文件名的部分。 Would you be able to help?你能帮忙吗? I appreciate any of your suggestions or guidance.我感谢您的任何建议或指导。 Thank you.谢谢你。 Andy安迪

**Program #1 called "urls":**

urls = [
    "https://www.dentoncad.com/content/data-extracts/1-appraisal-data-extracts/1-2019/1-preliminary/2019-preliminary" \
    "-protax-data.zip",
    "http://www.dallascad.org/ViewPDFs.aspx?type=3&id=//DCAD.ORG\WEB\WEBDATA\WEBFORMS\DATA%20PRODUCTS\DCAD2020_" \
    "CURRENT.ZIP"
]

keywords = [
    "APPRAISAL_ENTITY_INFO",
    "SalesExport",
    "account_info",
    "account_apprl_year",
    "res_detail",
    "applied_std_exempt",
    "land",
    "acct_exempt_value"
]`enter code here`

    enter code here

**Program #2 (primary program):**

import requests
import zipfile
import os
import urls


def main():
    print_header()
    dwnld_zfiles_from_web()


def print_header():
    print('---------------------------------------------------------------------')
    print('               DOWNLOAD ZIP FILES FROM THE WEB APP')
    print('---------------------------------------------------------------------')
    print()


def dwnld_zfiles_from_web():
    file_num = 0

    dest_folder = "C:/Users/agbpi/OneDrive/Desktop/test//"

    # loop through each url within the url list, assigning it a unique file number each iteration
    for url in urls.urls:
        file_num = file_num + 1
        url_resp = requests.get(url, allow_redirects=True, timeout=5)

        if url_resp.status_code == 200:
            saved_archive = os.path.basename(url)
            with open(saved_archive, 'wb') as f:
                f.write(url_resp.content)

                # for match in urls.keywords:

                print("Extracting...", url_resp.url)

                with zipfile.ZipFile('file{0}'.format(str(file_num)), "r") as z:
                    zip_files = z.namelist()
                    # print(zip_files)
                    for content in zip_files:
                        while urls.keywords in content:
                            z.extract(path=dest_folder, member=content)
                    # while urls.keywords in zip_files:
                    #     for content in zip_files:
                    #         z.extract(path=dest_folder, member=content)

                print("Finished!")


if __name__ == '__main__':
    main()

Okay, updated answer based on updated question.好的,根据更新的问题更新答案。

Your code is fine until this part:在这部分之前,您的代码很好:

                with zipfile.ZipFile('file{0}'.format(str(file_num)), "r") as z:
                    zip_files = z.namelist()
                    # print(zip_files)
                    for content in zip_files:
                        while urls.keywords in content:
                            z.extract(path=dest_folder, member=content)

Issue 1第一期

You already have the zip file name as saved_archive , but you try to open something else as a zipfile.您已经将 zip 文件名作为saved_archive ,但您尝试以 zip 文件的形式打开其他内容。 Why 'file{0}'.format(str(file_num)) ?为什么'file{0}'.format(str(file_num)) You should just with zipfile.ZipFile(saved_archive, "r") as z:您应该只with zipfile.ZipFile(saved_archive, "r") as z:

Issue 2第 2 期

while is kind of an if statement, but it does not work as a filter (it seems you wanted that). while是一种if语句,但它不能用作过滤器(看起来你想要那个)。 What while does is that it checks if the condition of the statement (after the while part) is True -ish and if so, it executes the indented code. while的作用是检查语句的条件(在 while 部分之后)是否为True -ish,如果是,则执行缩进的代码。 And as soon as the first False -ish evaluation kicks in, the code execution moves on.一旦第一次False评估开始,代码执行就会继续。 So if your condition evaluation would yield these results [True, False, True] , the first would trigger the indented code to run, the second would result an exit, and the third one would be ignored due the previous exit condition.因此,如果您的条件评估将产生这些结果[True, False, True] ,第一个将触发缩进代码运行,第二个将导致退出,第三个将由于先前的退出条件而被忽略。 But the condition is invalid which leads to:但条件无效,导致:

Issue 3第 3 期

url.keywords is a list and content is a str . url.keywords是一个listcontent是一个str A list in string will never make sense.字符串中的列表永远不会有意义。 It is like ['apple', 'banana'] in 'b' .就像['apple', 'banana'] in 'b' 'b' won't have such members. 'b'不会有这样的成员。 You could reverse the logic, but keep in mind that 'b' in ['apple', 'banana'] will be False , 'banana' in ['apple', 'banana'] will be True .您可以颠倒逻辑,但请记住 ['apple', 'banana'] 中'b' in ['apple', 'banana']将为False'banana' in ['apple', 'banana']将为True

Which means in your case that this condition: '_SalesExport.txt' in urls.keywords will be False ?这意味着在您的情况下, '_SalesExport.txt' in urls.keywords将是False Why?为什么? Because url.keywords is:因为url.keywords是:

[
    "APPRAISAL_ENTITY_INFO",
    "SalesExport",
    "account_info",
    "account_apprl_year",
    "res_detail",
    "applied_std_exempt",
    "land",
    "acct_exempt_value"
]

and SalesExport is not _SalesExport.txt .并且SalesExport不是_SalesExport.txt

To achieve partial match check, you need to compare list items (strings) against a string.要实现部分匹配检查,您需要将列表项(字符串)与字符串进行比较。 "SalesExport" in "_SalesExport.txt" is True , but "SalesExport" in ["_SalesExport.txt"] is False because SalesExport is not a member of the list. "SalesExport" in "_SalesExport.txt"True ,但"SalesExport" in ["_SalesExport.txt"]的 "SalesExport" 为False ,因为SalesExport不是列表的成员。

There are three things you could do:你可以做三件事:

  1. update your keywords list to exact filenames so content in kw_list could work (this means that if there is a directory structure in the zip file, you must include that one too)将您的keywords列表更新为准确的文件名,以便content in kw_list可以工作(这意味着如果 zip 文件中有目录结构,您也必须包含该目录结构)
                    for content in zip_files:
                        if content in urls.keywords:
                            z.extract(path=dest_folder, member=content)
  1. implement a for cycle in for cycle在for循环中实现for循环
                    for content in zip_files:
                        for kw in urls.keywords:
                            if kw in content:
                                z.extract(path=dest_folder, member=content)
  1. use a generator使用发电机
matches = [x for x in zip_files if any(y for y in urls.keywords if y in x)]
for m in matches:
    z.extract(path=dest_folder, member=m)


Finally, a recommendation:最后,给个建议:

Timeouts超时

Be careful with小心

url_resp = requests.get(url, allow_redirects=True, timeout=5) . url_resp = requests.get(url, allow_redirects=True, timeout=5)

"timeout" controls two things, connection timeout and read timeout. “超时”控制两件事,连接超时和读取超时。 Since response may take longer than 5 sec, you may want a longer read timeout.由于响应时间可能超过 5 秒,您可能需要更长的读取超时时间。 You can specify timeout as tuple: (connect timeout, read timeout).您可以将超时指定为元组:(连接超时,读取超时)。 So a better parameter would be:所以更好的参数是:

url_resp = requests.get(url, allow_redirects=True, timeout=(5, 120))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM