简体   繁体   English

如何对文本文件逐行排序

[英]How to sort a text file line-by-line

I need to sort a text file in ascending order.我需要按升序对文本文件进行排序。 Each line of the text file starts with an index, as seen below:文本文件的每一行都以一个索引开头,如下所示:

2       0       4         0d 07:00:38.0400009155273
3       0       4         0d 07:00:38.0400009155273
1       0       4         0d 07:00:38.0400009155273   

The idea result would be as follows:想法结果如下:

1       0       4         0d 07:00:38.0400009155273
2       0       4         0d 07:00:38.0400009155273
3       0       4         0d 07:00:38.0400009155273 

Please note, this text file has +3 million rows and each element is naturally considered a string.请注意,这个文本文件有 300 万行,每个元素自然被认为是一个字符串。

I've been messing around with this for sometime now without any luck so I figured it was time to consult with the experts.我一直在搞这个,现在没有任何运气,所以我想是时候咨询专家了。 Thank you for you time!谢谢你的时间!

EDIT:编辑:

I'm using windows OS with Python 3.7 in Spyder IDE.我在 Spyder IDE 中使用带有 Python 3.7 的 Windows 操作系统。 The file is not a CSV its a text file that is tab delimited.该文件不是 CSV,而是由制表符分隔的文本文件。 There is the possibility that not all indices are present.有可能并非所有索引都存在。 Forgive the noob-ness, I haven't got a lot of experience coding.原谅菜鸟,我没有很多编码经验。

fn = 'filename.txt'
sorted_fn = 'sorted_filename.txt'

with open(fn,'r') as first_file:
    rows = first_file.readlines()
    sorted_rows = sorted(rows, key=lambda x: int(x.split()[0]), reverse=False)
    with open(sorted_fn,'w') as second_file:
        for row in sorted_rows:
            second_file.write(row)

This should work for a text file of 3+ million rows.这应该适用于 3+ 百万行的文本文件。 Using int(x.split()[0]) will sort the first item in each row as an integer使用int(x.split()[0])将每行中的第一项作为整数排序

Edited to remove close() statements编辑以删除 close() 语句

I would go about this by reading the file into lines, splitting them on whitespace and then sorting them according to a custom key;我会通过将文件读入行,在空白处将它们拆分,然后根据自定义键对它们进行排序来解决此问题; ie, if your file were called "foo.txt":即,如果您的文件被称为“foo.txt”:

with open("foo.txt") as file:
    lines = file.readlines()
    sorted(lines, key=lambda line: int(line.split()[0]))

After that, lines should contain all lines sorted by the first column.之后,行应包含按第一列排序的所有行。

However, I don't know how well this would work, regarding your file size.但是,关于您的文件大小,我不知道这会有多好。 Maybe you would have to split the file's contents into chunks that you sort one by one and then you can sort the chunks.也许您必须将文件的内容拆分为一个一个排序的块,然后才能对这些块进行排序。

I would use a simple .split(' ') to format the data into a dictionary that looks like:我会使用一个简单的.split(' ')将数据格式化为一个字典,如下所示:

my_data = {
 2: ['0', '4', '0d', '07:00:38.0400009155273'],
 3: ['0', '4', '0d', '07:00:38.0400009155273'],
 1: ['0', '4', '0d', '07:00:38.0400009155273']
}

Which you could then iterate through (assuming all keys exist) like:然后您可以遍历(假设所有键都存在),例如:

for i in range(1, max(list(my_data.keys())) + 1):
    pass # do some computation

Additionally you could single out a specific value like my_data[1]此外,您可以挑出一个特定的值,如my_data[1]

To be able to put your data in this form I would use the script:为了能够将您的数据放入这种形式,我将使用脚本:

with open("foo.txt", "r") as file:
    in_data = file.readlines()

my_data = {}
for data in in_data:
    split_info = data.split(" ")
    useful_data = [item.strip() for item in split_info[1:] if item != ""]
    my_data.update({split_info[0]: useful_data})

for key in sorted(my_data.keys()):
    print("{}: {}".format(key, my_data[key]))

Which prints:哪个打印:

1: ['0', '4', '0d', '07:00:38.0400009155273'] 1: ['0', '4', '0d', '07:00:38.0400009155273']

2: ['0', '4', '0d', '07:00:38.0400009155273'] 2: ['0', '4', '0d', '07:00:38.0400009155273']

3: ['0', '4', '0d', '07:00:38.0400009155273'] 3: ['0', '4', '0d', '07:00:38.0400009155273']

Use pandas it will help you immensely.使用 pandas 它将极大地帮助您。 Assuming the file is a csv do the following:假设文件是​​ csv,请执行以下操作:

import pandas as pd
df = pd.read_csv('to/file', sep='\t', index='Name of column with index')  # Guessing that your file is tab separated
df.sort_index(inplace=True)

Now you have a dataframe with all of the information you need sorted.现在您有一个包含所有需要排序的信息的数据框。 I'd suggest digging into pandas since it will really help you out.我建议深入研究熊猫,因为它真的会帮助你。 Here is a link to get started https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html这是开始使用的链接https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html

Here's an edited version of a perfectly good answer you already have.这是您已经拥有的完美答案的编辑版本。 The edits might be useful as you learn more about coding.当您了解更多有关编码的信息时,这些编辑可能会很有用。 The key points:关键点:

  • When writing a program, it's often best to do your coding with a small sample of the input data (for example, a file with 30 rows rather than 3 million): your program will run quicker;编写程序时,通常最好使用输入数据的一小部分样本(例如,具有 30 行而不是 300 万行的文件)进行编码:您的程序将运行得更快; debugging output will be smaller and more readable;调试输出将更小且更具可读性; and some other reasons as well.以及其他一些原因。 Thus, rather than hard-coding the path to the input file (or other files), take those file paths as command-line parameters, using sys.argv .因此,与其对输入文件(或其他文件)的路径进行硬编码,不如使用sys.argv将这些文件路径作为命令行参数。

     import sys in_path = sys.argv[1] out_path = sys.argv[2]
  • If you are holding a lot of data in memory (enough to make you think you are close to your machine's limits), don't create unneeded copies of the data.如果您在内存中保存了大量数据(足以让您认为已接近机器的极限),请不要创建不需要的数据副本。 For example, to ignore the first few lines, don't store the original lines in rows and then get the desired values using rows[2:] : that creates a new list.例如,要忽略前几行,不要将原始行存储rows ,然后使用rows[2:]获取所需的值:这会创建一个新列表。 Instead add the conditional logic to your initial creation of rows (the example uses a list comprehension, but you can do the same thing in a regular for loop).而是将条件逻辑添加到您最初创建的rows (该示例使用列表推导式,但您可以在常规for循环中执行相同的操作)。 And if you need to sort that data, don't use sorted() , which creates a new list;如果您需要对数据进行排序,请不要使用sorted() ,它会创建一个新列表; instead, sort the list in place, with rows.sort() .相反,使用rows.sort()对列表进行排序。

     with open(in_path, 'r') as fh: rows = [line for i, line in enumerate(fh) if i > 1] rows.sort(key = lambda x: int(x.split(None, 1)[0]))
  • There's no reason to nest the writing with-block inside the reading with-block.没有理由将写入 with-block 嵌套在读取 with-block 中。 If you don't have a good reason to connect two different tasks within a program, explicitly separate them.如果您没有充分的理由在程序中连接两个不同的任务,请明确将它们分开。 This is among of the most important keys to writing better software.这是编写更好软件的最重要的关键之一。

     with open(out_path, 'w') as fh: for r in rows: fh.write(r)

A one-stop solution would be to do reading, sorting and writing all with one file handle.一站式解决方案是使用一个文件句柄进行读取、排序和写入。 Thanks to 'r+' mode:感谢'r+'模式:

with open('your_file.txt', 'r+') as f:
    sorted_contents =  ''.join(sorted(f.readlines(), key = lambda x: int(x.split(' ')[0])))
    f.seek(0)
    f.truncate()
    f.write(sorted_contents)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM