简体   繁体   English

一组Python迭代顺序

[英]Python iteration order on a set

I am parsing two big files (Gb size order), that each contains keys and corresponding values . 我正在解析两个大文件(按Gb大小排序),每个文件都包含keys和相应的values Some keys are shared between the two files, but with differing corresponding values . 一些keys在两个文件之间共享,但是具有不同的对应values For each of the files, I want to write to a new file the keys* and corresponding values , with keys* representing keys present both in file1 and file2. 对于每个文件,我想将keys*和对应的values写入一个新文件,其中keys*表示file1和file2中都存在的密钥。 I don't care on the key order in the output, but the should absolutely be in the same order in the two files. 我不在乎输出中的key顺序,但是两个文件中的key顺序应该绝对相同。

File 1: 文件1:

key1
value1-1
key2
value1-2
key3
value1-3

File2: 文件2:

key1
value2-1
key5
value2-5
key2
value2-2

A valid output would be: 有效输出为:

Parsed File 1: 解析文件1:

key1
value1-1
key2
value1-2

Parsed File 2: 解析的文件2:

key1
value2-1
key2
value2-2

An other valid output: 其他有效输出:

Parsed File 1: 解析文件1:

key2
value1-2
key1
value1-1

Parsed File 2: 解析的文件2:

key2
value2-2
key1
value2-1

An invalid output (keys in differing order in file 1 and file 2): 无效的输出(文件1和文件2中的键顺序不同):

Parsed File 1: 解析文件1:

key2
value1-2
key1
value1-1

Parsed File 2: 解析的文件2:

key1
value2-1
key2
value2-2

A last precision is that value sizes are by far bigger than key sizes. 最后一个精度是值的大小远大于键的大小。

What I am thinking to do is : 我想做的是:

  • For each input file, parse and return a dict (let's call it file_index ) with keys corresponding to the keys in the file, and values corresponding to the offset where the key was found in the input file. 对于每个输入文件,解析并返回一个dict (我们将其file_index ),该file_index具有与文件中的键相对应的键以及与在输入文件中找到键的偏移量相对应的值。

  • Compute the intersection 计算交集

     good_keys = file1_index.viewkeys() & file2_index.viewkeys() 
  • do something like (pseudo-code) : 做类似(伪代码)的操作:

     for each file: for good_key in good_keys: offset = file_index[good_key] go to offset in input_file get corresponding value write (key, value) to output file 

Does iterating over the same set guarantee me to have the exact same order (providing that it is the same set: I won't modify it between the two iterations), or should I convert the set to a list first, and iterate over the list? 是否在同一集合上进行迭代可确保我具有完全相同的顺序(前提是它同一集合:我不会在两次迭代之间修改它),还是应该先将集合转换为列表,然后在清单?

Python's dicts and sets are stable, that is, if you iterate over them without changing them they are guaranteed to give you the same order. Python的dict和set是稳定的,也就是说,如果您在不更改它们的情况下对其进行迭代,则可以保证它们的顺序相同。 This is from the documentation on dicts : 这是来自dicts文档

Keys and values are iterated over in an arbitrary order which is non-random, varies across Python implementations, and depends on the dictionary's history of insertions and deletions. 键和值以任意顺序进行迭代,该顺序是非随机的,在Python实现中会有所不同,并且取决于字典的插入和删除历史。 If keys, values and items views are iterated over with no intervening modifications to the dictionary, the order of items will directly correspond. 如果对键,值和项目视图进行了迭代而没有对字典进行任何中间修改,则项目的顺序将直接对应。

Iteration over an un-modified set will always give you the same order. 在未修改的集合上进行迭代将始终为您提供相同的顺序。 The order is informed by the current values and their insertion history. 该顺序由当前值及其插入历史记录告知。

See Why is the order in dictionaries and sets arbitrary? 请参阅为什么字典和集合中的顺序是任意的? if you are interested in why that is. 如果您对为什么感兴趣。

Note that if you want to modify your files in place , then that'll only work if your entries have a fixed size. 请注意,如果您想就地修改文件,则只有在您的条目具有固定大小的情况下,它才起作用。 Files cannot be updated somewhere in the middle where that update consists of fewer or more characters than the characters you replaced. 无法在中间的某处更新文件,该更新包含的字符少于或少于您替换的字符。

Data in files is like a magnetic tape, you'd have to splice in longer or shorter pieces to replace data in the middle, but you can't do that with a file. 文件中的数据就像磁带一样,您必须拼接成更长或更短的片段以替换中间的数据,但是您不能使用文件来做到这一点。 You'd have to rewrite everything following the replaced key-value pair to make the rest fit. 您必须重写替换后的键值对之后的所有内容,以使其余部分适合您。

As already stated out dicts and sets are stable and provide the same order as long as you don't change it. 如前所述,命令和集合是稳定的,并且只要您不进行更改即可提供相同的顺序。 If you want a specific order you can use OrderedDict 如果您想要特定的订单,可以使用OrderedDict

From the collections library docs: 从集合库文档中:

>>> from collections import OrderedDict

>>> # regular unsorted dictionary
>>> d = {'banana': 3, 'apple':4, 'pear': 1, 'orange': 2}

>>> # dictionary sorted by key -- OrderedDict(sorted(d.items()) also works
>>> OrderedDict(sorted(d.items(), key=lambda t: t[0]))
OrderedDict([('apple', 4), ('banana', 3), ('orange', 2), ('pear', 1)])

>>> # dictionary sorted by value
>>> OrderedDict(sorted(d.items(), key=lambda t: t[1]))
OrderedDict([('pear', 1), ('orange', 2), ('banana', 3), ('apple', 4)])

>>> # dictionary sorted by length of the key string
>>> OrderedDict(sorted(d.items(), key=lambda t: len(t[0])))
OrderedDict([('pear', 1), ('apple', 4), ('orange', 2), ('banana', 3)])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM