Python 2：使用正则表达式从文本文件中提取整行，并从另一个子字符串中提取子串

Question

I have a noob question. 我有一个菜鸟问题。 I am using python 2.7.6 on a Linux system. 我在Linux系统上使用python 2.7.6。

What I am trying to achieve is to use specific numbers in a list, which correspond to the last number in a database text file, to pull out the whole line in the database text file and print it (going to write the line to another text file later). 我想实现是使用列表中的具体数字，这相当于在过去的数database的文本文件，退出在全行database文本文件，并打印（去行写入到另一个文本文件稍后）。

Code I am currently trying to use: 我目前正在尝试使用的代码：

reg = re.compile(r'(\d+)$')

for line in "text file database":
    if list_line in reg.findall(line):
        print line

What I have found is that I can input a string like 我发现我可以输入类似

list_line = "9"

and it will output the whole line of the corresponding database entry just fine. 它将输出相应数据库条目的整行。 But trying to use the list_line to input strings one by one in a loop doesn't work. 但是尝试使用list_line在循环中一个接一个地输入字符串是行不通的。

Can anyone please help me out or direct me to a relevant source? 谁能帮助我或将我定向到相关消息来源？

Appendix: 附录：

The text file database text file contains data similar to these: 文本文件database文本文件包含与以下类似的数据：

gnl Acep_1.0 ACEP10001-PA 1
gnl Acep_1.0 ACEP10002-PA 2
gnl Acep_1.0 ACEP10003-PA 3
gnl Acep_1.0 ACEP10004-PA 4
gnl Acep_1.0 ACEP10005-PA 5
gnl Acep_1.0 ACEP10006-PA 7
gnl Acep_1.0 ACEP10007-PA 6
gnl Acep_1.0 ACEP10008-PA 8
gnl Acep_1.0 ACEP10009-PA 9
gnl Acep_1.0 ACEP10010-PA 10

The search text file list_line looks similar to this: 搜索文本文件list_line类似于以下内容：

Updated original code: 更新了原始代码：

    #import extensions
    import linecache

    import re

    #set re.compiler parameters
    reg = re.compile(r'(\d+)$')

    #Designate and open list file
    in_list = raw_input("list input: ")

    open_list = open(in_list, "r")

    #Count lines in list file
    total_lines = sum(1 for line in open_list)

    print total_lines

    #Open out file in write mode
    outfile = raw_input("output: ")

    open_outfile = open(outfile, "w")

    #Designate db string
    db = raw_input("db input: ")

    open_db = open(db, "r")

    read_db = open_db.read() 

    split_db = read_db.splitlines()

    print split_db      

    #Set line_number value to 0
    line_number = 0

    #Count through line numbers and print line
    while line_number < total_lines:
        line_number = line_number + 1
        print line_number

        list_line = linecache.getline(in_list, line_number)
        print list_line

        for line in split_db:
            if list_line in reg.findall(line) :
                print line 

    #close files
    open_list.close()

    open_outfile.close()

    open_db.close()

Answer 1

Short version: your for loop is going through the "database" file once , looking for the corresponding text and stopping. 短版：你for循环是通过“数据库”文件去一次，找相应的文字和停止。 So if you have multiple lines you want to pull out, like in your list_line file, you'll only end up pulling out a single line. 因此，如果您要拉出多行，就像在list_line文件中一样，最终只会拉出一行。

Also, the way you're looking for the line number isn't a great idea. 另外，您寻找行号的方法也不是一个好主意。 What happens if you're looking for line 5, but the second line just happens to have the digit 5 somewhere in its data? 如果您正在寻找第5行，但是第二行恰好在其数据中某处有数字5会发生什么呢？ Eg, if the second line looks like: 例如，如果第二行看起来像：

gnl Acep_1.0 ACEP15202-PA 2

Then searching for "5" will return that line instead of the one you intended. 然后搜索“ 5”将返回该行，而不是您想要的那一行。 Instead, since you know the line number is going to be the last number on the line, you should take advantage of Python's str.split() function (which splits a string on spaces, and returns the last item of and the fact that you can use -1 as a list index to get the last item of a list, like so: 相反，由于您知道行号将是该行的最后一个数字，因此您应该利用Python的str.split()函数（该函数在空格处分割一个字符串，并返回的最后一项和事实）可以使用-1作为列表索引来获取列表的最后一项，如下所示：

def get_one_line(line_number_string):
    with open("database_file.txt", "r") as datafile: # Open file for reading
        for line in datafile:  # This is how you get one line at a time in Python
            items = line.rstrip().split()
            if items[-1] == line_number_string:
                return line

One thing I haven't talked about is the rstrip() function. 我没有谈论的一件事是rstrip()函数。 When you iterate over a file in Python, you get each line as-is, with its newline characters still intact. 当您在Python中遍历文件时，每行都保持原样，而换行符仍保持不变。 When you print it later, you'll probably be using print -- but print also prints a newline character at the end of what you give it. 以后打印时，可能会使用print ，但是print还会在输出内容的末尾打印换行符。 So unless you use rstrip() you'll end up with two newlines characters instead of one, resulting in an extra blank line between every line of your output. 因此，除非您使用rstrip()否则最终将使用两个换行符而不是一个换行符，从而导致输出的每一行之间都有多余的空白行。

The other thing you're probably not familiar with there is the with statement. 您可能不熟悉的另一件事是with语句。 Without going into too much detail, that ensures that your database file will be closed when the return line statement is executed. 无需赘述，这可以确保在执行return line语句时关闭数据库文件。 The details of how with works are interesting reading for someone who knows a lot about Python, but as a Python newbie you probably won't want to dive into that just yet. 如何在细节with作品的人谁知道了很多关于Python，但作为一个Python新手，你可能不会想潜入那只是还没有有趣的阅读。 Just remember that when you open a file, try to use with open("filename") as some_variable: and Python will Do The Right Thing™. 只需记住，当您打开文件时，请尝试将with open("filename") as some_variable: Python将做正确的事。

Okay. 好的。 So now that you have that get_one_line() function, you can use it like this: 因此，现在有了该get_one_line()函数，就可以像这样使用它：

with open("list_line.txt", "r") as line_number_file:
    for line in line_number_file:
        line_number_string = line.rstrip() # Don't want the newline character
        database_line = get_one_line(line_number_string)
        print database_line # Or do whatever you need to with it

NOTE: If you're using Python 3, replace print line with print(line) : in Python 3, the print statement became a function. 注意：如果您使用的是Python 3，请在Python 3中用print(line)替换print line ： print语句成为一个函数。

There's more that you could do with this code (for example, opening the database file every single time you look for a line is kind of inefficient -- reading the whole thing into memory once and then looking for your data afterwards would be better). 您可以使用此代码做更多的事情（例如，每次查找一行都打开数据库文件效率很低-将整个内容读入内存一次，然后再查找数据会更好）。 But this is good enough to get started with, and if your database file is small, the time you'd lose worrying about efficiency would be far more than the time you'd lose just doing it the simple-but-slower way. 但这足以开始使用，而且如果您的数据库文件很小，那么您担心效率的时间将远远超过以简单但缓慢的方式进行操作的时间。

So see if this helps you, then come back and ask more questions if there's something you don't understand or that isn't working. 因此，请查看这是否对您有帮助，然后回过头来问更多问题，以了解您是否不了解或不起作用。

Answer 2

You can build your regex pattern from the content of the list_line file: 您可以从list_line文件的内容构建正则表达式模式：

import re

with open('list_line.txt') as list_line:
    pattern = list_line.read().replace('\n', '|')
    regex = re.compile('(' + pattern + ')$')

print('pattern = ' + regex.pattern)

with open('database.txt') as database:
    for line in database:
        if regex.search(line):
            print(line)

Python 2：使用正则表达式从文本文件中提取整行，并从另一个子字符串中提取子串

问题描述

2 个解决方案

解决方案1
1 已采纳 2015-10-03 02:57:22

解决方案2
0 2015-10-03 03:13:14

Python 2：使用正则表达式从文本文件中提取整行，并从另一个子字符串中提取子串

问题描述

2 个解决方案

解决方案1 1 已采纳 2015-10-03 02:57:22

解决方案2 0 2015-10-03 03:13:14

解决方案1
1 已采纳 2015-10-03 02:57:22

解决方案2
0 2015-10-03 03:13:14