简体   繁体   中英

Python reading file and analysing lines with substring

In Python, I'm reading a large file with many many lines. Each line contains a number and then a string such as:

[37273738] Hello world!
[83847273747] Hey my name is James!

And so on...

After I read the txt file and put it into a list, I was wondering how I would be able to extract the number and then sort that whole line of code based on the number?

file = open("info.txt","r")
myList = []

for line in file:
    line = line.split()
    myList.append(line)

What I would like to do:

since the number in message one falls between 37273700 and 38000000, I'll sort that (along with all other lines that follow that rule) into a separate list

这完全符合您的需求(用于分拣部分)

my_sorted_list = sorted(my_list, key=lambda line: int(line[0][1:-2]))

Use tuple as key value:

for line in file:
    line = line.split()
    keyval = (line[0].replace('[','').replace(']',''),line[1:])
    print(keyval)
    myList.append(keyval)

Sort

my_sorted_list = sorted(myList, key=lambda line: line[0])

How about:

# ---
# Function which gets a number from a line like so:
#  - searches for the pattern: start_of_line, [, sequence of digits
#  - if that's not found (e.g. empty line) return 0
#  - if it is found, try to convert it to a number type
#  - return the number, or 0 if that conversion fails

def extract_number(line):
    import re
    search_result = re.findall('^\[(\d+)\]', line)
    if not search_result:
        num = 0
    else:
        try:
            num = int(search_result[0])
        except ValueError:
            num = 0

    return num

# ---

# Read all the lines into a list
with open("info.txt") as f:
    lines = f.readlines()

# Sort them using the number function above, and print them
lines = sorted(lines, key=extract_number)
print ''.join(lines)

It's more resilient in the case of lines without numbers, it's more adjustable if the numbers might appear in different places (eg spaces at the start of the line).

(Obligatory suggestion not to use file as a variable name because it's a builtin function name already, and that's confusing).


Now there's an extract_number() function, it's easier to filter:

lines2 = [L for L in lines if 37273700 < extract_number(L) < 38000000]
print ''.join(lines2)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM