简体   繁体   English

当文件具有奇怪的格式时,对文本文件的行进行数字排序

[英]Sorting lines of text file numerically when the file has strange formatting

I'm having an issue comprehending how to do this.我在理解如何执行此操作时遇到问题。 I have a txt file handling birthdays as such:我有一个处理生日的 txt 文件:

**January birthdays:**
**17** - !@Mark
**4** - !@Jan
**15** - !@Ralph

**February birthdays:**
**27** - !@Steve
**19** - !@Bill
**29** - !@Bob

The list continues for every month, each month is separated by a blank line.该列表持续每个月,每个月由一个空行分隔。 How on Earth do you sort the days sequentially with formatting like this?你到底是如何使用这样的格式对日子进行排序的?

For example January should be:例如一月应该是:

**January birthdays:**
**4** - !@Jan
**15** - !@Ralph 
**17** - !@Mark

What I've brainstormed:我的头脑风暴:

I thought maybe I could potentially use readlines() from specific indexes and then save each line to a list, check the integer somehow, and then re-write the file properly.我想也许我可以使用特定索引中的 readlines() ,然后将每一行保存到一个列表中,以某种方式检查 integer ,然后正确地重新写入文件。 But this seems so tedious and frankly seems like the totally wrong idea.但这似乎很乏味,坦率地说,这似乎是一个完全错误的想法。

I also considered using partial() to read until a stop condition such as the line of the next month and then sort somehow based on that.我还考虑使用partial()读取直到停止条件,例如下个月的行,然后基于此以某种方式排序。

Does Python offer any easier way to do something like this? Python 是否提供更简单的方法来做这样的事情?

You can do it as follows.您可以按如下方式进行。

Code代码

import re

def order_month(month_of_entries):
    '''
        Order lines for a Month of entries
    '''
    # Sort key based upon number in line
    # First line in Month does not have a number, 
    # so key function returns 0 for it so it stays first
    month_of_entries.sort(key=lambda x: int(p.group(0)) if (p:=re.search('\d+', x)) else 0)
            
# Process input file
with open('input.txt', 'r') as file:
    results = []
    months_data = []
    for line in file:
        line = line.rstrip()
        if line:
            months_data.append(line)
        else:
            # blank line
            # Order files for this month
            order_month(months_data)
            results.append(months_data)
            
            # Setup for next month
            months_data = []
    else:
        # Reached end of file
        # Order lines for last month
        if months_data:
            order_entries(months_data)
            results.append(months_data)
               
# Write to output file
with open('output.txt', 'w') as file:
    for i, months_data in enumerate(results):
        # Looping over each month
        for line in months_data:
            file.write(line + '\n')
        # Add blank line if not last month
        if i < len(results) - 1:
            file.write('\n')           
         

Output Output

**January birthdays:**
**4** - !@Jan
**15** - !@Ralph
**17** - !@Mark

**February birthdays:**
**19** - !@Bill
**27** - !@Steve
**29** - !@Bob

Alternativee, that also sort months if necessary Alternativee,如有必要,也可以对月份进行排序

import re
from itertools import accumulate
from datetime import date
    
def find_day(s, pattern=re.compile(r'\d+')): 
    return 99 if not s.strip() else int(p.group(0)) if (p:=pattern.search(s)) else 0

def find_month(previous, s, pattern = re.compile(fr"^\*\*({'|'.join(months_of_year)})")):
    ' Index of Month in year (i.e. 1-12)'
    return months_of_year.index(p.group(1)) if (p:=pattern.search(s)) else previous

with open('test.txt') as infile:
    lines = infile.readlines()
    
months_of_year = [date(2021, i, 1).strftime('%B') for i in range(1, 13)] # Months of year
months = list(accumulate(lines, func = find_month, initial = ''))[1:]   # Create Month for each line
days = (find_day(line) for line in lines)                               # Day for each line

# sort lines based upon it's month and day
result = (x[-1] for x in sorted(zip(months, days, lines), key = lambda x: x[:2]))
    
with open('output.txt', 'w') as outfile:
    outfile.writelines(result)
    

This program runs under Windows or Linux, which have a sort program.该程序在 Windows 或 Linux 下运行,它们有一个排序程序。 It works by reading in each line of the input file and prepending to each line 4 characters, a 2-digit month number and a 2-digit day number (for the blank line between months it uses '99' as the day number so that it follows all the birthdays for the month).它通过读取输入文件的每一行并在每行前面添加 4 个字符、一个 2 位数的月份编号和一个 2 位数的日期编号(对于月份之间的空白行,它使用 '99' 作为日期编号,以便它遵循该月的所有生日)。 It then pipes these modified lines to the sort program and processes the piped output to remove the first 4 characters and to rewrite the file in place , which means you might want to make a backup of the file before running this in case the computer goes down midway in the processing.然后它将这些修改后的行通过管道传输到排序程序并处理管道 output 以删除前 4 个字符并在适当位置重写文件,这意味着您可能需要在运行之前备份文件以防计算机出现故障在处理的中途。 It shouldn't be too difficult to modify the code to write the output to a separate file.修改代码以将 output 写入单独的文件应该不会太难。

This technique is used because no assumption is made on the size of the file -- there could be millions of birthdays for a given month.使用这种技术是因为没有对文件的大小做出任何假设——给定月份可能有数百万个生日。 As long as the sort program can handle the input, this program can.只要排序程序可以处理输入,这个程序就可以。

from subprocess import Popen, PIPE
import sys
import re

p = Popen('sort', stdin=PIPE, stdout=PIPE, shell=True, text=True)
month_no = 0
with open('test.txt', 'r+') as f:
    for line in f:
        if " birthdays:**" in line:
            month_no += 1
            p.stdin.write("%02d00" % month_no)
        else:
            m = re.match(r'\*\*(\d+)\*\*', line)
            if m:
                p.stdin.write("%02d%02d" % (month_no, int(m[1])))
            else:
                # blank line?
                p.stdin.write("%02d99" % month_no)
        p.stdin.write(line)
    p.stdin.close()
    f.seek(0, 0) # reposition back to beginning
    for line in p.stdout:
        f.write(line[4:]) # skip over
    f.truncate() # this really shouldn't be necesssary
p.wait()

Using collections.defaultdict will be very handy here so you don't need to do any checking you can just add data.在这里使用collections.defaultdict将非常方便,因此您无需进行任何检查,只需添加数据即可。 You basically just read the file keeping the current month in a variable and check if you're on a new month or not then if you're on a new month just update it, if you're on a date get the day and append the string.您基本上只需读取将当前月份保存在变量中的文件,然后检查您是否在新月份,然后如果您在新月份,只需更新它,如果您在日期获取日期和 append字符串。 (this allows for multiple people to have the same birthday. (这允许多人拥有相同的生日。

from collections import defaultdict

data = defaultdict(lambda: defaultdict(list))

with open('filename.txt') as infile:
    month = next(infile).strip()
    for line in infile:
        if not line.strip(): continue
        if line[2].isalpha():
            month = line.strip()
        else:
            data[month][int(line.split('**')[1])].append(line.strip())

This gets you data neatly into a dict as shown below based on your example:根据您的示例,这会将您的数据整齐地放入字典中,如下所示:

{'**January birthdays:**': {17: ['**17** - !@Mark'], 4: ['**4** - !@Jan'], 15: ['**15** - !@Ralph']},
 '**February birthdays:**': {27: ['**27** - !@Steve'], 19: ['**19** - !@Bill'], 29: ['**29** - !@Bob']}}

From here you just loop back through the data and just sort the dates as you loop and write to file.从这里您只需循环返回数据并在循环和写入文件时对日期进行排序。

with open('filename.txt', 'w') as outfile:
    for month, days in data.items():
        outfile.write(month + '\n')
        for day in sorted(days):
            for day_text in days[day]:
                outfile.write(day_text + '\n')
        outfile.write('\n')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM