I am checking the position of semicolons in text files. I have length-delimited text files having thousands of rows which look like this:
AB;2;43234;343;
CD;4;41234;443;
FE53234;543;
FE;5;53;34;543;
I am using the following code to check the correct position of the semicolons. If a semicolon is missing where I would expect it, a statement is printed:
import glob
path = r'C:\path\*.txt'
for fname in glob.glob(path):
print("Checking file", fname)
with open(fname) as f:
content = f.readlines()
for count, line in enumerate(content):
if (line[2:3]!=";"
or line[4:5]!=";"
or line[10:11]!=";"
# really a lot of continuing entries like these
or line[14:15]!=";"
):
print("\nSemikolon expected, but not found!\nrow:", count+1, "\n", fname, "\n", line)
The code works. No error is thrown and it detects the data row.
My problem now is that I have a lot of semicolons to check and I have really a lot of continuing entries like
or line[xx:xx]!=";"
I think this is inefficient regarding two points:
or
checks. I think it could be more efficient probably decreasing the runtime.I search for an efficient solution which:
I only want to check if there are semicolons where I would expect them. Where I need them. I do not care about any additional semicolons in the data fields.
Just going off of what you've written:
filename = ...
with open(filename) as file:
lines = file.readlines()
delimiter_indices = (2, 4, 10, 14) # The indices in any given line where you expect to see semicolons.
for line_num, line in enumerate(lines):
if any(line[index] != ";" for index in delimiter_indices):
print(f"{filename}: Semicolon expected on line #{line_num}")
If the line doesn't have at least 15 characters, this will raise an exception. Also, lines like ;;;;;;;;;;;;;;;
are technically valid.
EDIT: Assuming you have an input file that looks like:
AB;2;43234;343;
CD;4;41234;443;
FE;5;53234;543;
FE;5;53;34;543;
(Note: the blank line at the end) My provided solution works fine. I do not see any exceptions or Semicolon expected on line #...
outputs.
If your input file ends with two blank lines, this will raise an exception. If your input file contains a blank line somewhere in the middle, this will also raise an exception. If you have lines in your file that are less than 15 characters long (not counting the last line), this will raise an exception.
You could simply say that every line must meet two criteria to be considered valid:
max(delimiter_indices) + 1
characters long).Code:
for line_num, line in enumerate(lines):
is_long_enough = len(line) >= (max(delimiter_indices) + 1)
has_correct_semicolons = all(line[index] == ';' for index in delimiter_indices)
if not (is_long_enough and has_correct_semicolons):
print(f"{filename}: Semicolon expected on line #{line_num}")
EDIT: My bad, I ruined the short-circuit evaluation for the sake of readability. The following should work:
is_valid_line = (len(line) >= (max(delimiter_indices) + 1)) and (all(line[index] == ';' for index in delimiter_indices))
if not is_valid_line:
print(f"{filename}: Semicolon expected on line #{line_num}")
If the length of the line is not correct, the second half of the expression will not be evaluated due to short-circuit evaluation, which should prevent the IndexError
.
EDIT: Since you have so many files with so many lines and so many semicolons per line, you could do the max(delimiter_indices)
calculation before the loop to avoid having calculate that value for each line. It may not make a big difference, but you could also just iterate over the file object directly (which yields the next line each iteration), as opposed to loading the entire file into memory before you iterate via lines = file.readlines()
. This isn't really required, and it's not as cute as using all
or any
, but I decided to turn the has_correct_semicolons
expression into an actual loop that iterates over delimiter indices - that way your error message can be a bit more explicit, pointing to the offending index of the offending line. Also, there's a separate error message for when a line is too short.
import glob
delimiter_indices = (2, 4, 10, 14)
max_delimiter_index = max(delimiter_indices)
min_line_length = max_delimiter_index + 1
for path in glob.glob(r"C:\path\*.txt"):
filename = path.name
print(filename.center(32, "-"))
with open(path) as file:
for line_num, line in enumerate(file):
is_long_enough = len(line) >= min_line_length
if not is_long_enough:
print(f"{filename}: Line #{line_num} is too short")
continue
has_correct_semicolons = True
for index in delimiter_indices:
if line[index] != ";":
has_correct_semicolons = False
break
if not has_correct_semicolons:
print(f"{filename}: Semicolon expected on line #{line_num}, character #{index}")
print("All files done")
If you just want to validate the structure of the lines, you can use a regex that is easy to maintain if your requirement changes:
import re
with open(fname) as f:
for row, line in enumerate(f, 1):
if not re.match(r"[A-Z]{2};\d;\d{5};\d{3};", line):
print("\nSemicolon expected, but not found!\nrow:", row, "\n", fname, "\n", line)
If you don't actually care about the content and only want to check the position of the ;
, you can simplify the regex to: r".{2};.;.{5};.{3};"
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.