从多个文本文件中提取 URLS 的循环

Question

我正在尝试使用 for 循环从多个文件中提取 URLS 列表，但是这导致仅从第一个文件中提取 URLS 列表，重复 10 次。 我不确定我做错了什么。 此外，我在这方面绝对是初学者，所以我认为有更好的方法来尝试实现我想要的，但这是我目前所拥有的。

type_urls = []
y = 0

for files in cwk_dir:
    while y < 10:
        open('./cwkfiles/cwkfile{}.crawler.idx'.format(y))
        lines = r.text.splitlines()
        header_loc = 7
        name_loc = lines[header_loc].find('Company Name')
        type_loc = lines[header_loc].find('Form Type')
        cik_loc = lines[header_loc].find('CIK')
        filedate_loc = lines[header_loc].find('Date Filed')
        url_loc = lines[header_loc].find('URL')
        firstdata_loc = 9
        for line in lines[firstdata_loc:]:
            company_name = line[:type_loc].strip()
            form_type = line[type_loc:cik_loc].strip()
            cik = line[cik_loc:filedate_loc].strip()
            file_date = line[filedate_loc:url_loc].strip()
            page_url = line[url_loc:].strip()
            typeandurl = (form_type, page_url)
            type_urls.append(typeandurl)
        y = y + 1

Answer 1

这是使用pathlib和 Python 3 的更 Pythonic 方式：

from pathlib import Path

cwk_dir = Path('./cwkfiles')

type_urls = []
header_loc = 7
firstdata_loc = 9

for cwkfile in cwk_dir.glob('cwkfile*.crawler.idx'):
    with cwkfile.open() as f:
        lines = f.readlines()
        name_loc = lines[header_loc].find('Company Name')
        type_loc = lines[header_loc].find('Form Type')
        cik_loc = lines[header_loc].find('CIK')
        filedate_loc = lines[header_loc].find('Date Filed')
        url_loc = lines[header_loc].find('URL')
        for line in lines[firstdata_loc:]:
            company_name = line[:type_loc].strip()
            form_type = line[type_loc:cik_loc].strip()
            cik = line[cik_loc:filedate_loc].strip()
            file_date = line[filedate_loc:url_loc].strip()
            page_url = line[url_loc:].strip()
            type_urls.append((form_type, page_url))

如果您想对小批量文件进行测试，请将cwk_dir.glob('cwkfile*.crawler.idx')替换为cwk_dir.glob('cwkfile[0-9].crawler.idx') 。 如果它们从 0 开始按顺序编号，这将为您提供第一个 then 文件。

这是将所有内容放在一起并以更具可读性的方式更好的方法：

from pathlib import Path


def get_offsets(header):
    return dict(
        company_name = header.find('Company Name'),
        form_type = header.find('Form Type'),
        cik = header.find('CIK'),
        file_date = header.find('Date Filed'),
        page_url = header.find('URL')
    )


def get_data(line, offsets):
    return dict(
        company_name = line[:offsets['form_type']].strip(),
        form_type = line[offsets['form_type']:offsets['cik']].strip(),
        cik = line[offsets['cik']:offsets['file_date']].strip(),
        file_date = line[offsets['file_date']:offsets['page_url']].strip(),
        page_url = line[offsets['page_url']:].strip()
    )


cwk_dir = Path('./cwkfiles')
types_and_urls = []
header_line = 7
first_data_line = 9

for cwkfile in cwk_dir.glob('cwkfile*.crawler.idx'):
    with cwkfile.open() as f:
        lines = f.readlines()
        offsets = get_offsets(lines[header_line])
        for line in lines[first_data_line:]:
            data = get_data(line, offsets)
            types_and_urls.append((data['form_type'], data['page_url']))

Answer 2

当您进入第二个文件时，while 条件失败，因为y已经是 10。尝试在 while 循环之前将y设置回 0：

for files in cwk_dir:
    y = 0
    while y < 10:
        ...

当您在 while 循环内的第一行打开文件时，您可能需要在退出循环时关闭它。

从多个文本文件中提取 URLS 的循环

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-12-17 13:06:46

解决方案2
0 2019-12-17 12:44:13

从多个文本文件中提取 URLS 的循环

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-12-17 13:06:46

解决方案2 0 2019-12-17 12:44:13

解决方案1
1 已采纳 2019-12-17 13:06:46

解决方案2
0 2019-12-17 12:44:13