简体   繁体   English

通过 dict.items() 的迭代顺序

[英]Order of iteration through dict.items()

TLDR: If I build a dictionary at two seperate times from the same data processed in the same way, should the order of dictionary.items() be the same each time? TLDR:如果我从以相同方式处理的相同数据分两次构建字典,dictionary.items() 的顺序每次都应该相同吗?

Hello,你好,

I have dictionary linked_strain_acc which has about 2000 keys (strain names) and each key has another dictionary as a value ( data ).我有字典linked_strain_acc ,它有大约 2000 个键(菌株名称),每个键都有另一个字典作为值( data )。

linked_strain_acc = {'strain1' : {'gcf' : ['gcf1', 'gcf'..],
                                  'key2' : val2,
                                  .........},
                    'strain2' :  {.........},
                    ..........
                    'strain2000' :  {.........}}
          

I am iterating over a key ( 'gcf' ) in each data dictionary, which contains a list of gcf ids.我正在迭代每个data字典中的一个键 ( 'gcf' ),其中包含一个gcf id 列表。 I'm using the gcf ids to build a url for scraping, after testing that it's not already been scraped.我正在使用gcf id 来构建一个用于抓取的 url,在测试它尚未被抓取后。

directory = r'C:\Users\u03132tk\.spyder-py3\scrape_dsmz\zip_files'
count = 0
start = time.time()
#allows you to stop and start
current_files = os.listdir(directory)
for strain,data in linked_strain_acc.items():
    for gcf in data['gcf']:
        count+=1
        filename = f'{strain}__{gcf}.zip'
        if filename not in current_files:
            download_url = f'https://antismash-db.secondarymetabolites.org/output/{gcf}/{gcf}.zip'
            response = requests.get(download_url)
            with open(fr'{directory}\{filename}', "wb") as infile:
                infile.write(response.content)
            print (f'downloaded {strain}, {gcf}')
        else:
            print (f'{strain}, {gcf} already scraped')
        if count%50 == 0:
            print (f'downloaded {count} jsons - script has been running for {round((time.time() - start)/60, 1)} minutes')

Question

I have already scraped about 1500 of the gcf urls and downloaded the files (out of the 2000ish total).我已经gcf大约 1500 个gcf网址并下载了文件(总共 2000 个)。 When I ran it again this morning, instead of printing '{strain}, {gcf} already scraped' for the first 1500 print statements, its alternating between a couple of '{strain}, {gcf} already scraped' print messages and 'downloaded {strain}, {gcf}' print statements.当我今天早上再次运行它时,不是为前 1500 个打印语句打印 '{strain}, {gcf} already scraped',而是在几个 '{strain}, {gcf} already scraped' 打印消息和 '下载了 {strain}, {gcf}' 打印语句。 This implies that the order of the linked_strain_acc dictionary has changed.这意味着linked_strain_acc 字典的顺序已经改变。

I made this dictionary from a CSV file which was processed in exactly the same way each time to make linked_strain_acc .我从一个 CSV 文件制作了这本字典,该文件每次都以完全相同的方式进行处理以制作linked_strain_acc Why would the order of the dict change, or am I missing something?为什么 dict 的顺序会改变,或者我错过了什么? I know that dict key/val order isn't ordered by eg alphabet or size, but I though it would be maintained when it is built from exactly the same data.我知道 dict key/val order 不是按例如字母表或大小排序的,但我认为当它是从完全相同的数据构建时会被维护。

Thanks!谢谢!

In older versions, python was using string pools to efficiently store longer strings by pooling shorter common segments.在旧版本中,python 使用字符串池通过池化较短的公共段来有效地存储较长的字符串。 Every time you create a string, it may change the pool, and hence the order.每次创建字符串时,它可能会更改池,从而更改顺序。 The strings you dynamically create in您在其中动态创建的字符串

download_url = f'https://antismash-db.secondarymetabolites.org/output/{gcf}/{gcf}.zip'

may change the pool depending on your starting point.可能会根据您的起点更改池。 For reference: https://en.wikipedia.org/wiki/String_interning供参考: https : //en.wikipedia.org/wiki/String_interning

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM