如何計算python文本文件中相同關鍵字的2次出現分隔的行數？

Question

我有一個 python抓取腳本來獲取有關一些即將舉行的音樂會的信息，並且無論將出現多少場音樂會，它每次都是相同的文本模式，唯一的區別是有時它會在門票價格上顯示額外的一行。可預訂，例如以下示例：

LIVE 01/01/99 9PM
Iron Maiden
Madison Square Garden 
New York City
LIVE 01/01/99 9.30PM
The Doors
Staples Center
Los Angeles
LIVE 01/02/99 8.45PM
Dr Dre & Snoop Dogg
Staples Center
Los Angeles
Book a ticket now for $99,99
LIVE 01/02/99 9PM
Diana Ross
City Hall
New York City 
Book a ticket now for $79,99       ect...

我需要計算每個文本塊的行數並檢查它是 4 行還是 5 行，所以我在想的是計算每個塊的第一個單詞（“LIVE”）的出現次數，然后添加一個if 語句在 2 個類別之間對塊進行排序（4 行塊和 5 行塊）

if 語句部分並不難，但我根本不知道如何做第一部分，也許是 readlines 然后當一行有關鍵字“LIVE”時，添加行位置（提供的數據樣本將是行1、第5行、第9行、第14行，這里我們可以清楚地看到前2個塊是4行，第3個是5行）然后if語句部分把它們整理出來

任何幫助將不勝感激，謝謝！

用我對代碼的想法進行編輯，我希望它會更清晰，因為我需要獲取變量 line_number 和 gap_each_line 的代碼：

with open('concerts_list.txt', 'r') as file:          
    reading_file = file.read()
    lines = reading_file.split('\n')
    for "LIVE" in lines:
        line_number = #the part where I'm stuck to tell each line number
 where the word "LIVE" appears. output desired: [0, 4, 8, 13]
        gap_each_line = #calculate the gap between each number of previous 
variable line_number. output desired: [4, 4, 5]
    if gap == 4 for gap in gap_each_line:
        dates = [i for i in lines [0::4]]
    elif gap == 5 for gap in gap_each_line:
        dates = [i for i in lines [0::5]]

Answer 1

您可以使用read_csv of pandas模塊的read_csv of pandas 。

我希望您的整個問題（查找日期等）都可以使用pandas來解決。

下面是查找以“LIVE”開頭的行之間的行差異的代碼

import pandas as pd
df = pd.read_csv('/Users/prince/Downloads/test3.csv', sep='~~~', header=None, engine='python')
df.columns = ['Details']
df['si_no'] = df['Details'].str.startswith('LIVE').cumsum()
gaps = df.groupby('si_no').apply(lambda x : len(x)).values
print(gaps)

它會打印

[4 4 5 5]

Answer 2

我知道你提供了你想要的輸出（弗朗西斯王子的回答），但我覺得你正試圖以一種困難的方式解決一些問題。

請看看這個：

from collections import defaultdict #Defaultdict let's you create a dictionary, which is already set up to contain a list for every key

concerts = defaultdict(list)
current_dictKey = None # Starts "unset"
with open('/tmp/concerts_list.txt', 'r') as file:
    reading_file = file.read()
    lines = reading_file.split('\n')
    for line in lines:
        print('I just read the following:', line)
        if line.startswith('LIVE'):
            print('The current line starts with keyword "live", so this will be the dictionarys new Key')
            current_dictKey = line
            continue # Continue to next line without doing anything else

        if line.startswith('Book a ticket'):
            print("This line starts with 'book a ticket'. Let's skip those too.")
            continue # Skip those lines. I guess you don't want them either.

        concerts[current_dictKey].append(line) # Just add the line to the Key in defaultdict


print()
print('This is the object "concerts"you get as the result:')
print(concerts)
print()


print('You can access a specific value like this:', concerts['LIVE 01/01/99 9PM'])

一旦它在字典中，您就可以非常輕松地訪問您的所有數據。

Answer 3

（我創建了一個新答案，因為我仍然認為我的第一個更適合大多數情況。）

這將創建您想要的輸出：

live_lines = []
line_counter = 0
distances = []
with open('concerts_list.txt', 'r') as file:
    reading_file = file.read()
    lines = reading_file.split('\n')
    for line in lines:
        if line.startswith('LIVE'):
            live_lines.append(line_counter)

        line_counter += 1

for position in range(len(live_lines)-1):
    new_distance = live_lines[position+1] - live_lines[position]
    distances.append(new_distance)

print('live_lines:', live_lines)
print('distances', distances)

輸出：

live_lines: [0, 4, 8, 13]
Distances [4, 4, 5]

Answer 4

與您編寫的內容類似，但以更pythonic 的風格編寫。

with open('concerts_list.txt', 'rt') as file:
    indices = [index for index, line in enumerate(file) if line.startswith("LIVE")]
    block_lengths = [adjacent - current for current, adjacent in zip(indices , indices [1:])]

如果你的文件真的很大，你可以使用generator comprehension 、 itertools.tee 、 itertools.islice來延遲加載你需要在內存中計算的數據。 因此，與此處的第一個示例相比，您使用Iterator對象而不是內存列表來處理數據流。

import itertools

with open('concerts_list.txt', 'rt') as file:
    # generator comprehension
    indices = (index for index, line in enumerate(file) if line.startswith("LIVE"))
    # itertools.tee make copies of iterators
    indices_1, indices_2 = itertools.tee(indices)
    # here itertools.islice make new iterator without first element
    block_lengths = [adjacent - current for current, adjacent in
                     zip(indices_1, itertools.islice(indices_2, 1, None))]

如何計算python文本文件中相同關鍵字的2次出現分隔的行數？

問題描述

4 個解決方案

解決方案1
0 2019-12-16 18:36:44

解決方案2
0 2019-12-16 18:42:41

解決方案3
0 2019-12-16 19:00:43

解決方案4
0 2019-12-16 20:12:59

如何計算python文本文件中相同關鍵字的2次出現分隔的行數？

問題描述

4 個解決方案

解決方案1 0 2019-12-16 18:36:44

解決方案2 0 2019-12-16 18:42:41

解決方案3 0 2019-12-16 19:00:43

解決方案4 0 2019-12-16 20:12:59

解決方案1
0 2019-12-16 18:36:44

解決方案2
0 2019-12-16 18:42:41

解決方案3
0 2019-12-16 19:00:43

解決方案4
0 2019-12-16 20:12:59