简体   繁体   English

检查预期分号 position 长度分隔文本文件的有效方法。 组合许多“或”语句

[英]Efficient way to check for expected semicolon position length-delimited text file. Combining many "or" statements

I am checking the position of semicolons in text files.我正在检查文本文件中分号的 position。 I have length-delimited text files having thousands of rows which look like this:我有包含数千行的长度分隔文本文件,如下所示:

AB;2;43234;343;
CD;4;41234;443;
FE53234;543;
FE;5;53;34;543;

I am using the following code to check the correct position of the semicolons.我正在使用以下代码来检查分号的正确 position。 If a semicolon is missing where I would expect it, a statement is printed:如果在我期望的地方缺少分号,则会打印一条语句:

import glob

path = r'C:\path\*.txt'

for fname in glob.glob(path):
    print("Checking file", fname)
    with open(fname) as f:
        content = f.readlines()
        for count, line in enumerate(content):
            if (line[2:3]!=";" 
                or line[4:5]!=";" 
                or line[10:11]!=";"
               # really a lot of continuing entries like these
                or line[14:15]!=";"
                ):
                print("\nSemikolon expected, but not found!\nrow:", count+1, "\n", fname, "\n", line)

The code works.该代码有效。 No error is thrown and it detects the data row.没有错误被抛出并且它检测到数据行。

My problem now is that I have a lot of semicolons to check and I have really a lot of continuing entries like我现在的问题是我有很多分号要检查,而且我确实有很多连续的条目,比如

or line[xx:xx]!=";"

I think this is inefficient regarding two points:我认为这在两点上是低效的:

  1. It is visually not nice to have these many code lines.有这么多代码行在视觉上不太好。 I think it could be shortened.我认为它可以缩短。
  2. It is logically not efficient to have these many splitted or checks.有这么多拆分or检查在逻辑上是没有效率的。 I think it could be more efficient probably decreasing the runtime.我认为它可能会更有效地减少运行时间。

I search for an efficient solution which:我正在寻找一种有效的解决方案:

  1. Improves the readability提高可读性
  2. Most importantly: reduces the runtime (as I think the way it is written now is inefficient, with all the or statements)最重要的是:减少运行时间(因为我认为现在的编写方式效率低下,所有 or 语句)

I only want to check if there are semicolons where I would expect them.我只想检查是否有我期望的分号。 Where I need them.我需要它们的地方。 I do not care about any additional semicolons in the data fields.我不关心数据字段中的任何额外分号。

Just going off of what you've written:只是离开你写的东西:

filename = ...

with open(filename) as file:
    lines = file.readlines()
delimiter_indices = (2, 4, 10, 14) # The indices in any given line where you expect to see semicolons.
for line_num, line in enumerate(lines):
    if any(line[index] != ";" for index in delimiter_indices):
        print(f"{filename}: Semicolon expected on line #{line_num}")

If the line doesn't have at least 15 characters, this will raise an exception.如果该行没有至少 15 个字符,则会引发异常。 Also, lines like ;;;;;;;;;;;;;;;此外,像;;;;;;;;;;;;;;;这样的行are technically valid.在技术上是有效的。


EDIT: Assuming you have an input file that looks like:编辑:假设您有一个如下所示的输入文件:

AB;2;43234;343;
CD;4;41234;443;
FE;5;53234;543;
FE;5;53;34;543;

(Note: the blank line at the end) My provided solution works fine. (注意:末尾的空行)我提供的解决方案工作正常。 I do not see any exceptions or Semicolon expected on line #... outputs.我没有Semicolon expected on line #...

If your input file ends with two blank lines, this will raise an exception.如果您的输入文件以两个空行结尾,这将引发异常。 If your input file contains a blank line somewhere in the middle, this will also raise an exception.如果您的输入文件在中间某处包含一个空行,这也会引发异常。 If you have lines in your file that are less than 15 characters long (not counting the last line), this will raise an exception.如果文件中的行长度少于 15 个字符(不包括最后一行),这将引发异常。

You could simply say that every line must meet two criteria to be considered valid:您可以简单地说,每一行都必须满足两个条件才能被视为有效:

  1. The current line must be at least 15 characters long (or max(delimiter_indices) + 1 characters long).当前行必须至少有 15 个字符长(或max(delimiter_indices) + 1字符长)。
  2. All characters at delimiter indices in the current line must be semicolons.当前行中分隔符索引处的所有字符都必须是分号。

Code:代码:

for line_num, line in enumerate(lines):
    is_long_enough = len(line) >= (max(delimiter_indices) + 1)
    has_correct_semicolons = all(line[index] == ';' for index in delimiter_indices)

    if not (is_long_enough and has_correct_semicolons):
        print(f"{filename}: Semicolon expected on line #{line_num}")

EDIT: My bad, I ruined the short-circuit evaluation for the sake of readability.编辑:我的错,为了可读性,我破坏了短路评估。 The following should work:以下应该工作:

is_valid_line = (len(line) >= (max(delimiter_indices) + 1)) and (all(line[index] == ';' for index in delimiter_indices))
if not is_valid_line:
    print(f"{filename}: Semicolon expected on line #{line_num}")

If the length of the line is not correct, the second half of the expression will not be evaluated due to short-circuit evaluation, which should prevent the IndexError .如果行的长度不正确,表达式的后半部分将不会因为短路求值而被求值,这应该可以防止IndexError


EDIT: Since you have so many files with so many lines and so many semicolons per line, you could do the max(delimiter_indices) calculation before the loop to avoid having calculate that value for each line.编辑:因为你有这么多文件,每行有这么多行和这么多分号,你可以在循环之前进行max(delimiter_indices)计算,以避免为每一行计算该值。 It may not make a big difference, but you could also just iterate over the file object directly (which yields the next line each iteration), as opposed to loading the entire file into memory before you iterate via lines = file.readlines() .它可能没有太大区别,但您也可以直接迭代文件 object(每次迭代都会产生下一行),而不是在通过lines = file.readlines()迭代之前将整个文件加载到 memory 中。 This isn't really required, and it's not as cute as using all or any , but I decided to turn the has_correct_semicolons expression into an actual loop that iterates over delimiter indices - that way your error message can be a bit more explicit, pointing to the offending index of the offending line.这并不是真正需要的,它不像使用allany那样可爱,但我决定将has_correct_semicolons表达式变成一个实际的循环,该循环遍历定界符索引 - 这样你的错误消息可以更明确一点,指向违规行的违规索引。 Also, there's a separate error message for when a line is too short.此外,当一行太短时,还有一条单独的错误消息。

import glob

delimiter_indices = (2, 4, 10, 14)
max_delimiter_index = max(delimiter_indices)
min_line_length = max_delimiter_index + 1

for path in glob.glob(r"C:\path\*.txt"):
    filename = path.name
    print(filename.center(32, "-"))
    with open(path) as file:
        for line_num, line in enumerate(file):
            is_long_enough = len(line) >= min_line_length
            if not is_long_enough:
                print(f"{filename}: Line #{line_num} is too short")
                continue

            has_correct_semicolons = True
            for index in delimiter_indices:
                if line[index] != ";":
                    has_correct_semicolons = False
                    break

            if not has_correct_semicolons:
                print(f"{filename}: Semicolon expected on line #{line_num}, character #{index}")

print("All files done")

If you just want to validate the structure of the lines, you can use a regex that is easy to maintain if your requirement changes:如果您只想验证行的结构,则可以使用在您的要求发生变化时易于维护的正则表达式

import re

with open(fname) as f:
    for row, line in enumerate(f, 1):
        if not re.match(r"[A-Z]{2};\d;\d{5};\d{3};", line):
            print("\nSemicolon expected, but not found!\nrow:", row, "\n", fname, "\n", line)

Regex demo here.正则表达式演示在这里。

If you don't actually care about the content and only want to check the position of the ;如果你其实并不关心内容,只想查看 position 的; , you can simplify the regex to: r".{2};.;.{5};.{3};" ,您可以将正则表达式简化为: r".{2};.;.{5};.{3};"

Demo for the dot regex.点正则表达式的演示。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM