简体   繁体   English

Python读取文件并使用子字符串分析行

[英]Python reading file and analysing lines with substring

In Python, I'm reading a large file with many many lines. 在Python中,我正在阅读一个包含许多行的大文件。 Each line contains a number and then a string such as: 每行包含一个数字,然后是一个字符串,例如:

[37273738] Hello world!
[83847273747] Hey my name is James!

And so on... 等等...

After I read the txt file and put it into a list, I was wondering how I would be able to extract the number and then sort that whole line of code based on the number? 在我读取txt文件并将其放入列表后,我想知道如何提取数字然后根据数字对整行代码进行排序?

file = open("info.txt","r")
myList = []

for line in file:
    line = line.split()
    myList.append(line)

What I would like to do: 我想做什么:

since the number in message one falls between 37273700 and 38000000, I'll sort that (along with all other lines that follow that rule) into a separate list 由于消息1中的数字介于37273700和38000000之间,我将把它(以及遵循该规则的所有其他行)排序到一个单独的列表中

这完全符合您的需求(用于分拣部分)

my_sorted_list = sorted(my_list, key=lambda line: int(line[0][1:-2]))

Use tuple as key value: 使用元组作为键值:

for line in file:
    line = line.split()
    keyval = (line[0].replace('[','').replace(']',''),line[1:])
    print(keyval)
    myList.append(keyval)

Sort 分类

my_sorted_list = sorted(myList, key=lambda line: line[0])

How about: 怎么样:

# ---
# Function which gets a number from a line like so:
#  - searches for the pattern: start_of_line, [, sequence of digits
#  - if that's not found (e.g. empty line) return 0
#  - if it is found, try to convert it to a number type
#  - return the number, or 0 if that conversion fails

def extract_number(line):
    import re
    search_result = re.findall('^\[(\d+)\]', line)
    if not search_result:
        num = 0
    else:
        try:
            num = int(search_result[0])
        except ValueError:
            num = 0

    return num

# ---

# Read all the lines into a list
with open("info.txt") as f:
    lines = f.readlines()

# Sort them using the number function above, and print them
lines = sorted(lines, key=extract_number)
print ''.join(lines)

It's more resilient in the case of lines without numbers, it's more adjustable if the numbers might appear in different places (eg spaces at the start of the line). 在没有数字的线条的情况下,它更具弹性,如果数字可能出现在不同的位置(例如线条开头的空格),则更具可调性。

(Obligatory suggestion not to use file as a variable name because it's a builtin function name already, and that's confusing). (强制建议不要将file用作变量名,因为它已经是内置函数名,这很令人困惑)。


Now there's an extract_number() function, it's easier to filter: 现在有一个extract_number()函数,它更容易过滤:

lines2 = [L for L in lines if 37273700 < extract_number(L) < 38000000]
print ''.join(lines2)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM