简体   繁体   中英

Efficient way to check for expected semicolon position length-delimited text file. Combining many "or" statements

I am checking the position of semicolons in text files. I have length-delimited text files having thousands of rows which look like this:

AB;2;43234;343;
CD;4;41234;443;
FE53234;543;
FE;5;53;34;543;

I am using the following code to check the correct position of the semicolons. If a semicolon is missing where I would expect it, a statement is printed:

import glob

path = r'C:\path\*.txt'

for fname in glob.glob(path):
    print("Checking file", fname)
    with open(fname) as f:
        content = f.readlines()
        for count, line in enumerate(content):
            if (line[2:3]!=";" 
                or line[4:5]!=";" 
                or line[10:11]!=";"
               # really a lot of continuing entries like these
                or line[14:15]!=";"
                ):
                print("\nSemikolon expected, but not found!\nrow:", count+1, "\n", fname, "\n", line)

The code works. No error is thrown and it detects the data row.

My problem now is that I have a lot of semicolons to check and I have really a lot of continuing entries like

or line[xx:xx]!=";"

I think this is inefficient regarding two points:

  1. It is visually not nice to have these many code lines. I think it could be shortened.
  2. It is logically not efficient to have these many splitted or checks. I think it could be more efficient probably decreasing the runtime.

I search for an efficient solution which:

  1. Improves the readability
  2. Most importantly: reduces the runtime (as I think the way it is written now is inefficient, with all the or statements)

I only want to check if there are semicolons where I would expect them. Where I need them. I do not care about any additional semicolons in the data fields.

Just going off of what you've written:

filename = ...

with open(filename) as file:
    lines = file.readlines()
delimiter_indices = (2, 4, 10, 14) # The indices in any given line where you expect to see semicolons.
for line_num, line in enumerate(lines):
    if any(line[index] != ";" for index in delimiter_indices):
        print(f"{filename}: Semicolon expected on line #{line_num}")

If the line doesn't have at least 15 characters, this will raise an exception. Also, lines like ;;;;;;;;;;;;;;;are technically valid.


EDIT: Assuming you have an input file that looks like:

AB;2;43234;343;
CD;4;41234;443;
FE;5;53234;543;
FE;5;53;34;543;

(Note: the blank line at the end) My provided solution works fine. I do not see any exceptions or Semicolon expected on line #... outputs.

If your input file ends with two blank lines, this will raise an exception. If your input file contains a blank line somewhere in the middle, this will also raise an exception. If you have lines in your file that are less than 15 characters long (not counting the last line), this will raise an exception.

You could simply say that every line must meet two criteria to be considered valid:

  1. The current line must be at least 15 characters long (or max(delimiter_indices) + 1 characters long).
  2. All characters at delimiter indices in the current line must be semicolons.

Code:

for line_num, line in enumerate(lines):
    is_long_enough = len(line) >= (max(delimiter_indices) + 1)
    has_correct_semicolons = all(line[index] == ';' for index in delimiter_indices)

    if not (is_long_enough and has_correct_semicolons):
        print(f"{filename}: Semicolon expected on line #{line_num}")

EDIT: My bad, I ruined the short-circuit evaluation for the sake of readability. The following should work:

is_valid_line = (len(line) >= (max(delimiter_indices) + 1)) and (all(line[index] == ';' for index in delimiter_indices))
if not is_valid_line:
    print(f"{filename}: Semicolon expected on line #{line_num}")

If the length of the line is not correct, the second half of the expression will not be evaluated due to short-circuit evaluation, which should prevent the IndexError .


EDIT: Since you have so many files with so many lines and so many semicolons per line, you could do the max(delimiter_indices) calculation before the loop to avoid having calculate that value for each line. It may not make a big difference, but you could also just iterate over the file object directly (which yields the next line each iteration), as opposed to loading the entire file into memory before you iterate via lines = file.readlines() . This isn't really required, and it's not as cute as using all or any , but I decided to turn the has_correct_semicolons expression into an actual loop that iterates over delimiter indices - that way your error message can be a bit more explicit, pointing to the offending index of the offending line. Also, there's a separate error message for when a line is too short.

import glob

delimiter_indices = (2, 4, 10, 14)
max_delimiter_index = max(delimiter_indices)
min_line_length = max_delimiter_index + 1

for path in glob.glob(r"C:\path\*.txt"):
    filename = path.name
    print(filename.center(32, "-"))
    with open(path) as file:
        for line_num, line in enumerate(file):
            is_long_enough = len(line) >= min_line_length
            if not is_long_enough:
                print(f"{filename}: Line #{line_num} is too short")
                continue

            has_correct_semicolons = True
            for index in delimiter_indices:
                if line[index] != ";":
                    has_correct_semicolons = False
                    break

            if not has_correct_semicolons:
                print(f"{filename}: Semicolon expected on line #{line_num}, character #{index}")

print("All files done")

If you just want to validate the structure of the lines, you can use a regex that is easy to maintain if your requirement changes:

import re

with open(fname) as f:
    for row, line in enumerate(f, 1):
        if not re.match(r"[A-Z]{2};\d;\d{5};\d{3};", line):
            print("\nSemicolon expected, but not found!\nrow:", row, "\n", fname, "\n", line)

Regex demo here.

If you don't actually care about the content and only want to check the position of the ; , you can simplify the regex to: r".{2};.;.{5};.{3};"

Demo for the dot regex.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM