Python 2：使用正則表達式從文本文件中提取整行，並從另一個子字符串中提取子串

Question

我有一個菜鳥問題。 我在Linux系統上使用python 2.7.6。

我想實現是使用列表中的具體數字，這相當於在過去的數database的文本文件，退出在全行database文本文件，並打印（去行寫入到另一個文本文件稍后）。

我目前正在嘗試使用的代碼：

reg = re.compile(r'(\d+)$')

for line in "text file database":
    if list_line in reg.findall(line):
        print line

我發現我可以輸入類似

list_line = "9"

它將輸出相應數據庫條目的整行。 但是嘗試使用list_line在循環中一個接一個地輸入字符串是行不通的。

誰能幫助我或將我定向到相關消息來源？

附錄：

文本文件database文本文件包含與以下類似的數據：

gnl Acep_1.0 ACEP10001-PA 1
gnl Acep_1.0 ACEP10002-PA 2
gnl Acep_1.0 ACEP10003-PA 3
gnl Acep_1.0 ACEP10004-PA 4
gnl Acep_1.0 ACEP10005-PA 5
gnl Acep_1.0 ACEP10006-PA 7
gnl Acep_1.0 ACEP10007-PA 6
gnl Acep_1.0 ACEP10008-PA 8
gnl Acep_1.0 ACEP10009-PA 9
gnl Acep_1.0 ACEP10010-PA 10

搜索文本文件list_line類似於以下內容：

更新了原始代碼：

    #import extensions
    import linecache

    import re

    #set re.compiler parameters
    reg = re.compile(r'(\d+)$')

    #Designate and open list file
    in_list = raw_input("list input: ")

    open_list = open(in_list, "r")

    #Count lines in list file
    total_lines = sum(1 for line in open_list)

    print total_lines

    #Open out file in write mode
    outfile = raw_input("output: ")

    open_outfile = open(outfile, "w")

    #Designate db string
    db = raw_input("db input: ")

    open_db = open(db, "r")

    read_db = open_db.read() 

    split_db = read_db.splitlines()

    print split_db      

    #Set line_number value to 0
    line_number = 0

    #Count through line numbers and print line
    while line_number < total_lines:
        line_number = line_number + 1
        print line_number

        list_line = linecache.getline(in_list, line_number)
        print list_line

        for line in split_db:
            if list_line in reg.findall(line) :
                print line 

    #close files
    open_list.close()

    open_outfile.close()

    open_db.close()

Answer 1

短版：你for循環是通過“數據庫”文件去一次，找相應的文字和停止。 因此，如果您要拉出多行，就像在list_line文件中一樣，最終只會拉出一行。

另外，您尋找行號的方法也不是一個好主意。 如果您正在尋找第5行，但是第二行恰好在其數據中某處有數字5會發生什么呢？ 例如，如果第二行看起來像：

gnl Acep_1.0 ACEP15202-PA 2

然后搜索“ 5”將返回該行，而不是您想要的那一行。 相反，由於您知道行號將是該行的最后一個數字，因此您應該利用Python的str.split()函數（該函數在空格處分割一個字符串，並返回的最后一項和事實）可以使用-1作為列表索引來獲取列表的最后一項，如下所示：

def get_one_line(line_number_string):
    with open("database_file.txt", "r") as datafile: # Open file for reading
        for line in datafile:  # This is how you get one line at a time in Python
            items = line.rstrip().split()
            if items[-1] == line_number_string:
                return line

我沒有談論的一件事是rstrip()函數。 當您在Python中遍歷文件時，每行都保持原樣，而換行符仍保持不變。 以后打印時，可能會使用print ，但是print還會在輸出內容的末尾打印換行符。 因此，除非您使用rstrip()否則最終將使用兩個換行符而不是一個換行符，從而導致輸出的每一行之間都有多余的空白行。

您可能不熟悉的另一件事是with語句。 無需贅述，這可以確保在執行return line語句時關閉數據庫文件。 如何在細節with作品的人誰知道了很多關於Python，但作為一個Python新手，你可能不會想潛入那只是還沒有有趣的閱讀。 只需記住，當您打開文件時，請嘗試將with open("filename") as some_variable: Python將做正確的事。

好的。 因此，現在有了該get_one_line()函數，就可以像這樣使用它：

with open("list_line.txt", "r") as line_number_file:
    for line in line_number_file:
        line_number_string = line.rstrip() # Don't want the newline character
        database_line = get_one_line(line_number_string)
        print database_line # Or do whatever you need to with it

注意：如果您使用的是Python 3，請在Python 3中用print(line)替換print line ： print語句成為一個函數。

您可以使用此代碼做更多的事情（例如，每次查找一行都打開數據庫文件效率很低-將整個內容讀入內存一次，然后再查找數據會更好）。 但這足以開始使用，而且如果您的數據庫文件很小，那么您擔心效率的時間將遠遠超過以簡單但緩慢的方式進行操作的時間。

因此，請查看這是否對您有幫助，然后回過頭來問更多問題，以了解您是否不了解或不起作用。

Answer 2

您可以從list_line文件的內容構建正則表達式模式：

import re

with open('list_line.txt') as list_line:
    pattern = list_line.read().replace('\n', '|')
    regex = re.compile('(' + pattern + ')$')

print('pattern = ' + regex.pattern)

with open('database.txt') as database:
    for line in database:
        if regex.search(line):
            print(line)

Python 2：使用正則表達式從文本文件中提取整行，並從另一個子字符串中提取子串

問題描述

2 個解決方案

解決方案1
1 已采納 2015-10-03 02:57:22

解決方案2
0 2015-10-03 03:13:14

Python 2：使用正則表達式從文本文件中提取整行，並從另一個子字符串中提取子串

問題描述

2 個解決方案

解決方案1 1 已采納 2015-10-03 02:57:22

解決方案2 0 2015-10-03 03:13:14

解決方案1
1 已采納 2015-10-03 02:57:22

解決方案2
0 2015-10-03 03:13:14