简体   繁体   中英

Python 2: Using regex to pull out whole lines from text file with substring from another

I have a noob question. I am using python 2.7.6 on a Linux system.

What I am trying to achieve is to use specific numbers in a list, which correspond to the last number in a database text file, to pull out the whole line in the database text file and print it (going to write the line to another text file later).

Code I am currently trying to use:

reg = re.compile(r'(\d+)$')

for line in "text file database":
    if list_line in reg.findall(line):
        print line

What I have found is that I can input a string like

list_line = "9"

and it will output the whole line of the corresponding database entry just fine. But trying to use the list_line to input strings one by one in a loop doesn't work.

Can anyone please help me out or direct me to a relevant source?

Appendix:

The text file database text file contains data similar to these:

gnl Acep_1.0 ACEP10001-PA 1
gnl Acep_1.0 ACEP10002-PA 2
gnl Acep_1.0 ACEP10003-PA 3
gnl Acep_1.0 ACEP10004-PA 4
gnl Acep_1.0 ACEP10005-PA 5
gnl Acep_1.0 ACEP10006-PA 7
gnl Acep_1.0 ACEP10007-PA 6
gnl Acep_1.0 ACEP10008-PA 8
gnl Acep_1.0 ACEP10009-PA 9
gnl Acep_1.0 ACEP10010-PA 10

The search text file list_line looks similar to this:

2
5
4
6

Updated original code:

    #import extensions
    import linecache

    import re

    #set re.compiler parameters
    reg = re.compile(r'(\d+)$')

    #Designate and open list file
    in_list = raw_input("list input: ")

    open_list = open(in_list, "r")

    #Count lines in list file
    total_lines = sum(1 for line in open_list)

    print total_lines

    #Open out file in write mode
    outfile = raw_input("output: ")

    open_outfile = open(outfile, "w")

    #Designate db string
    db = raw_input("db input: ")

    open_db = open(db, "r")

    read_db = open_db.read() 

    split_db = read_db.splitlines()

    print split_db      

    #Set line_number value to 0
    line_number = 0

    #Count through line numbers and print line
    while line_number < total_lines:
        line_number = line_number + 1
        print line_number

        list_line = linecache.getline(in_list, line_number)
        print list_line

        for line in split_db:
            if list_line in reg.findall(line) :
                print line 

    #close files
    open_list.close()

    open_outfile.close()

    open_db.close() 

Short version: your for loop is going through the "database" file once , looking for the corresponding text and stopping. So if you have multiple lines you want to pull out, like in your list_line file, you'll only end up pulling out a single line.

Also, the way you're looking for the line number isn't a great idea. What happens if you're looking for line 5, but the second line just happens to have the digit 5 somewhere in its data? Eg, if the second line looks like:

gnl Acep_1.0 ACEP15202-PA 2

Then searching for "5" will return that line instead of the one you intended. Instead, since you know the line number is going to be the last number on the line, you should take advantage of Python's str.split() function (which splits a string on spaces, and returns the last item of and the fact that you can use -1 as a list index to get the last item of a list, like so:

def get_one_line(line_number_string):
    with open("database_file.txt", "r") as datafile: # Open file for reading
        for line in datafile:  # This is how you get one line at a time in Python
            items = line.rstrip().split()
            if items[-1] == line_number_string:
                return line

One thing I haven't talked about is the rstrip() function. When you iterate over a file in Python, you get each line as-is, with its newline characters still intact. When you print it later, you'll probably be using print -- but print also prints a newline character at the end of what you give it. So unless you use rstrip() you'll end up with two newlines characters instead of one, resulting in an extra blank line between every line of your output.

The other thing you're probably not familiar with there is the with statement. Without going into too much detail, that ensures that your database file will be closed when the return line statement is executed. The details of how with works are interesting reading for someone who knows a lot about Python, but as a Python newbie you probably won't want to dive into that just yet. Just remember that when you open a file, try to use with open("filename") as some_variable: and Python will Do The Right Thing™.

Okay. So now that you have that get_one_line() function, you can use it like this:

with open("list_line.txt", "r") as line_number_file:
    for line in line_number_file:
        line_number_string = line.rstrip() # Don't want the newline character
        database_line = get_one_line(line_number_string)
        print database_line # Or do whatever you need to with it

NOTE: If you're using Python 3, replace print line with print(line) : in Python 3, the print statement became a function.

There's more that you could do with this code (for example, opening the database file every single time you look for a line is kind of inefficient -- reading the whole thing into memory once and then looking for your data afterwards would be better). But this is good enough to get started with, and if your database file is small, the time you'd lose worrying about efficiency would be far more than the time you'd lose just doing it the simple-but-slower way.

So see if this helps you, then come back and ask more questions if there's something you don't understand or that isn't working.

You can build your regex pattern from the content of the list_line file:

import re

with open('list_line.txt') as list_line:
    pattern = list_line.read().replace('\n', '|')
    regex = re.compile('(' + pattern + ')$')

print('pattern = ' + regex.pattern)

with open('database.txt') as database:
    for line in database:
        if regex.search(line):
            print(line)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM