從文本文件中提取兩行之間的數據

Question

假設我有數百個像這個例子一樣的文本文件：

NAME
John Doe

DATE OF BIRTH

1992-02-16

BIO 

THIS is
 a PRETTY
 long sentence

 without ANY structure 

HOBBIES 
//..etc..

NAME、DATE OF BIRTH、BIO 和 HOBBIES（以及其他）始終存在，但文本內容和它們之間的行數有時會發生變化。

我想遍歷文件並將字符串存儲在每個鍵之間。 例如，名為 Name 的變量應包含存儲在“NAME”和“DATE OF BIRTH”之間的值。

這就是我出現的：

lines = f.readlines()
for line_number, line in enumerate(lines):
    if "NAME" in line:     
        name = lines[line_number + 1]  # In all files, Name is one line long.
    elif "DATE OF BIRTH" in line:
        date = lines[line_number + 2] # Date is also always two lines after
    elif "BIO" in line:
        for x in range(line_number + 1, line_number + 20): # Length of other data can be randomly bigger
            if "HOBBIES" not in lines[x]:
                bio += lines[x]
            else:
                break
    elif "HOBBIES" in line:
        #...

這工作得很好，但我覺得不是使用許多雙循環，而是必須有一種更聰明、更簡單的方法來做到這一點。

我正在尋找一個通用的解決方案，其中 NAME 將存儲所有內容，直到出生日期，而 BIO 將存儲所有內容，直到 HOBBIES 等。目的是稍后清理和刪除額外的白色棉絨。

可能嗎？

編輯：當我閱讀答案時，我意識到我忘記了一個非常重要的細節，鍵有時會重復（以相同的順序）。

也就是說，一個文本文件可以包含多個人。 應創建人員列表。 鍵名標志着一個新人的開始。

Answer 1

我將所有內容存儲在字典中，請參見下面的代碼。

f = open("test.txt")
lines = f.readlines()
dict_text = {"NAME":[], "DATEOFBIRTH":[], "BIO":[]}
for line_number, line in enumerate(lines):
    if not ("NAME" in line or "DATE OF BIRTH" in line or "BIO" in line):
        text = line.replace("\n","")
        dict_text[location].append(text)
    else:
        location = "".join((line.split()))

Answer 2

您可以使用正則表達式：

import re

keys = """
NAME
DATE OF BIRTH
BIO 
HOBBIES 
""".strip().splitlines()

key_pattern = '|'.join(f'{key.strip()}' for key in keys)
pattern = re.compile(fr'^({key_pattern})', re.M)

# uncomment to see the pattern
# print(pattern)

with open(filename) as f:
    text = f.read()
    parts = pattern.split(text)

... process parts ...

parts將是一個列表字符串。 奇數索引位置（ parts[1] 、 parts[3] 、...）將是鍵（'NAME' 等）和偶數索引位置（ parts[2] 、 parts[4] 、...）將是鍵之間的文本。 parts[0]將是第一個鍵之前的任何內容。

Answer 3

您可以嘗試以下方法。

keys = ["NAME","DATE OF BIRTH","BIO","HOBBIES"]

f = open("data.txt", "r")
result = {}
for line in f:
    line = line.strip('\n')
    if any(v in line for v in keys):
        last_key = line
    else:
        result[last_key] = result.get(last_key, "") + line

print(result)

Output

{'NAME': 'John Doe', 'DATE OF BIRTH': '1992-02-16', 'BIO ': 'THIS is a PRETTY long sentence without ANY structure ', 'HOBBIES ': '//..etc..'}

Answer 4

您可以將文件轉換為一個長字符串，而不是讀取行。 使用 string.index() 查找觸發詞的起始索引，然后設置從該索引到下一個觸發詞索引到變量的所有內容。

就像是：

string = str(f)
important_words = ['NAME', 'DATE OF BIRTH']
last_phrase = None
for phrase in important_words:
   phrase_start = string.index(phrase)
   phrase_end = phrase_start + len(phrase)
   if last_phrase is not None:
      get_data(string, last_phrase, phrase_start)
   last_phrase = phrase_end

def get_data(string, previous_end_index, current_start_index):
   usable_data = string[previous_end_index: current_start_index]
   return usable_data

可能應該使用更好/更短的變量名

Answer 5

您可以將文本讀取為 1 個長字符串。 然后使用 .split() 這僅在類別有序且不重復的情況下才有效。 像這樣；

Categories = ["NAME", "DOB", "BIO"] // in the order they appear in text
Output = {}
Text = str(f)
for i in range(1,len(Categories)):
    SplitText = Text.split(Categories[i])
    Output.update({Categories[i-1] : SplitText[0] })
    Text = SplitText[1]
Output.update({Categories[-1] : Text})

從文本文件中提取兩行之間的數據

問題描述

5 個解決方案

解決方案1
2 已采納 2021-04-17 02:21:34

解決方案2
1 2021-04-17 02:24:32

解決方案3
1 2021-04-17 02:25:39

解決方案4
1 2021-04-17 02:27:50

解決方案5
1 2021-04-17 02:34:47

從文本文件中提取兩行之間的數據

問題描述

5 個解決方案

解決方案1 2 已采納 2021-04-17 02:21:34

解決方案2 1 2021-04-17 02:24:32

解決方案3 1 2021-04-17 02:25:39

解決方案4 1 2021-04-17 02:27:50

解決方案5 1 2021-04-17 02:34:47

解決方案1
2 已采納 2021-04-17 02:21:34

解決方案2
1 2021-04-17 02:24:32

解決方案3
1 2021-04-17 02:25:39

解決方案4
1 2021-04-17 02:27:50

解決方案5
1 2021-04-17 02:34:47