简体   繁体   English

在python中合并排序

[英]merge sort in python

basically I have a bunch of files containing domains. 基本上,我有一堆包含域的文件。 I've sorted each individual file based on its TLD using .sort(key=func_that_returns_tld) 我已经使用.sort(key = func_that_returns_tld)根据其TLD对每个文件进行了排序

now that I've done that I want to merge all the files and end up wtih one massive sorted file. 现在,我已经完成了合并所有文件并最终得到一个大型排序文件的任务。 I assume I need something like this: 我认为我需要这样的东西:

open all files
read one line from each file into a list
sort list with .sort(key=func_that_returns_tld)
output that list to file
loop by reading next line

am I thinking about this right? 我在考虑这个权利吗? any advice on how to accomplish this would be appreciated. 任何有关如何做到这一点的建议将不胜感激。

If your files are not very large, then simply read them all into memory (as S. Lott suggests). 如果文件不是很大,则只需将它们全部读入内存即可(如S. Lott所建议)。 That would definitely be simplest. 那绝对是最简单的。

However, you mention collation creates one "massive" file. 但是,您提到排序规则创建了一个“大量”文件。 If it's too massive to fit in memory, then perhaps use heapq.merge . 如果太大而无法容纳在内存中,则可以使用heapq.merge It may be a little harder to set up, but it has the advantage of not requiring that all the iterables be pulled into memory at once. 设置起来可能会有些困难,但是它的优点是不需要将所有可迭代对象立即拉入内存。

import heapq
import contextlib

class Domain(object):
    def __init__(self,domain):
        self.domain=domain
    @property
    def tld(self):
        # Put your function for calculating TLD here
        return self.domain.split('.',1)[0]
    def __lt__(self,other):
        return self.tld<=other.tld
    def __str__(self):
        return self.domain

class DomFile(file):
    def next(self):
        return Domain(file.next(self).strip())

filenames=('data1.txt','data2.txt')
with contextlib.nested(*(DomFile(filename,'r') for filename in filenames)) as fhs:
    for elt in heapq.merge(*fhs):
        print(elt)

with data1.txt: 使用data1.txt:

google.com
stackoverflow.com
yahoo.com

and data2.txt: 和data2.txt:

standards.freedesktop.org
www.imagemagick.org

yields: 产量:

google.com
stackoverflow.com
standards.freedesktop.org
www.imagemagick.org
yahoo.com

Unless your file is incomprehensibly huge, it will fit into memory. 除非您的文件很大,否则它将适合内存。

Your pseudo-code is hard to read. 您的伪代码很难阅读。 Please indent your pseudo-code correctly. 请正确缩进您的伪代码。 The final "loop by reading next line" makes no sense. 最后的“通过阅读下一行来循环”是没有意义的。

Basically, it's this. 基本上就是这个。

all_data= []
for f in list_of_files:
    with open(f,'r') as source:
        all_data.extend( source.readlines() )
all_data.sort(... whatever your keys are... )

You're done. 你完成了。 You can write all_data to a file, or process it further or whatever you want to do with it. 您可以将all_data写入文件,或对其进行进一步处理,或对其进行任何处理。

另一个选择(同样,仅当您的所有数据都无法容纳到内存中时)是创建一个SQLite3数据库并在那里进行排序,然后将其写入文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM