檢查預期分號 position 長度分隔文本文件的有效方法。組合許多“或”語句

Question

我正在檢查文本文件中分號的 position。 我有包含數千行的長度分隔文本文件，如下所示：

AB;2;43234;343;
CD;4;41234;443;
FE53234;543;
FE;5;53;34;543;

我正在使用以下代碼來檢查分號的正確 position。 如果在我期望的地方缺少分號，則會打印一條語句：

import glob

path = r'C:\path\*.txt'

for fname in glob.glob(path):
    print("Checking file", fname)
    with open(fname) as f:
        content = f.readlines()
        for count, line in enumerate(content):
            if (line[2:3]!=";" 
                or line[4:5]!=";" 
                or line[10:11]!=";"
               # really a lot of continuing entries like these
                or line[14:15]!=";"
                ):
                print("\nSemikolon expected, but not found!\nrow:", count+1, "\n", fname, "\n", line)

該代碼有效。 沒有錯誤被拋出並且它檢測到數據行。

我現在的問題是我有很多分號要檢查，而且我確實有很多連續的條目，比如

or line[xx:xx]!=";"

我認為這在兩點上是低效的：

有這么多代碼行在視覺上不太好。 我認為它可以縮短。
有這么多拆分or檢查在邏輯上是沒有效率的。 我認為它可能會更有效地減少運行時間。

我正在尋找一種有效的解決方案：

提高可讀性
最重要的是：減少運行時間（因為我認為現在的編寫方式效率低下，所有 or 語句）

我只想檢查是否有我期望的分號。 我需要它們的地方。 我不關心數據字段中的任何額外分號。

Answer 1

只是離開你寫的東西：

filename = ...

with open(filename) as file:
    lines = file.readlines()
delimiter_indices = (2, 4, 10, 14) # The indices in any given line where you expect to see semicolons.
for line_num, line in enumerate(lines):
    if any(line[index] != ";" for index in delimiter_indices):
        print(f"{filename}: Semicolon expected on line #{line_num}")

如果該行沒有至少 15 個字符，則會引發異常。 此外，像;;;;;;;;;;;;;;;這樣的行在技術上是有效的。

編輯：假設您有一個如下所示的輸入文件：

AB;2;43234;343;
CD;4;41234;443;
FE;5;53234;543;
FE;5;53;34;543;

（注意：末尾的空行）我提供的解決方案工作正常。 我沒有Semicolon expected on line #... 。

如果您的輸入文件以兩個空行結尾，這將引發異常。 如果您的輸入文件在中間某處包含一個空行，這也會引發異常。 如果文件中的行長度少於 15 個字符（不包括最后一行），這將引發異常。

您可以簡單地說，每一行都必須滿足兩個條件才能被視為有效：

當前行必須至少有 15 個字符長（或max(delimiter_indices) + 1字符長）。
當前行中分隔符索引處的所有字符都必須是分號。

代碼：

for line_num, line in enumerate(lines):
    is_long_enough = len(line) >= (max(delimiter_indices) + 1)
    has_correct_semicolons = all(line[index] == ';' for index in delimiter_indices)

    if not (is_long_enough and has_correct_semicolons):
        print(f"{filename}: Semicolon expected on line #{line_num}")

編輯：我的錯，為了可讀性，我破壞了短路評估。 以下應該工作：

is_valid_line = (len(line) >= (max(delimiter_indices) + 1)) and (all(line[index] == ';' for index in delimiter_indices))
if not is_valid_line:
    print(f"{filename}: Semicolon expected on line #{line_num}")

如果行的長度不正確，表達式的后半部分將不會因為短路求值而被求值，這應該可以防止IndexError 。

編輯：因為你有這么多文件，每行有這么多行和這么多分號，你可以在循環之前進行max(delimiter_indices)計算，以避免為每一行計算該值。 它可能沒有太大區別，但您也可以直接迭代文件 object（每次迭代都會產生下一行），而不是在通過lines = file.readlines()迭代之前將整個文件加載到 memory 中。這並不是真正需要的，它不像使用all或any那樣可愛，但我決定將has_correct_semicolons表達式變成一個實際的循環，該循環遍歷定界符索引 - 這樣你的錯誤消息可以更明確一點，指向違規行的違規索引。 此外，當一行太短時，還有一條單獨的錯誤消息。

import glob

delimiter_indices = (2, 4, 10, 14)
max_delimiter_index = max(delimiter_indices)
min_line_length = max_delimiter_index + 1

for path in glob.glob(r"C:\path\*.txt"):
    filename = path.name
    print(filename.center(32, "-"))
    with open(path) as file:
        for line_num, line in enumerate(file):
            is_long_enough = len(line) >= min_line_length
            if not is_long_enough:
                print(f"{filename}: Line #{line_num} is too short")
                continue

            has_correct_semicolons = True
            for index in delimiter_indices:
                if line[index] != ";":
                    has_correct_semicolons = False
                    break

            if not has_correct_semicolons:
                print(f"{filename}: Semicolon expected on line #{line_num}, character #{index}")

print("All files done")

Answer 2

如果您只想驗證行的結構，則可以使用在您的要求發生變化時易於維護的正則表達式：

import re

with open(fname) as f:
    for row, line in enumerate(f, 1):
        if not re.match(r"[A-Z]{2};\d;\d{5};\d{3};", line):
            print("\nSemicolon expected, but not found!\nrow:", row, "\n", fname, "\n", line)

正則表達式演示在這里。

如果你其實並不關心內容，只想查看 position 的; ，您可以將正則表達式簡化為： r".{2};.;.{5};.{3};"

點正則表達式的演示。

檢查預期分號 position 長度分隔文本文件的有效方法。組合許多“或”語句

問題描述

2 個解決方案

解決方案1
3 已采納 2023-01-02 09:56:31

解決方案2
0 2023-01-02 10:13:18

檢查預期分號 position 長度分隔文本文件的有效方法。 組合許多“或”語句

問題描述

2 個解決方案

解決方案1 3 已采納 2023-01-02 09:56:31

解決方案2 0 2023-01-02 10:13:18

檢查預期分號 position 長度分隔文本文件的有效方法。組合許多“或”語句

解決方案1
3 已采納 2023-01-02 09:56:31

解決方案2
0 2023-01-02 10:13:18