简体   繁体   English

基于模式拆分输入文件时 Python 中的 IndexError

[英]IndexError in Python while splitting an input file based on a pattern

The code tries to split the text data based on a separator but I keep getting an error该代码尝试根据分隔符拆分文本数据,但我不断收到错误消息

Traceback (most recent call last):
  File "split.py", line 7, in <module>
    en_text = split_text[1].lstrip()
IndexError: list index out of range

And the output of the two files has to be the same number of lines but I got 94132 en_out.txt and 94304 mn_out.txt for two of the files which im not sure what's going on.并且这两个文件的 output 必须具有相同的行数,但是对于其中两个文件,我得到了94132 en_out.txt94304 mn_out.txt ,我不确定发生了什么。

The code I used is我使用的代码是

with open('mn_en_sentences_split.txtaa') as inputFile:
    inFile = inputFile.readlines()

for i in inFile:
    split_text = i.split("+++++SEP+++++")
    mn_text = split_text[0].rstrip()
    en_text = split_text[1].lstrip()
    with open("mn_out.txt", "a") as mn_out:
        mn_out.write(mn_text + "\n")
    
    with open("en_out.txt", "a") as en_out:
        en_out.write(en_text)

The input file for this code can be found here at https://drive.google.com/file/d/1GNo1XJxRFxjey5VDsHjLvj9upXJOqd3e/view此代码的输入文件可以在https://drive.google.com/file/d/1GNo1XJxRFxjey5VDsHjLvj9upXJOqd3e/view 找到

The reason of the IndexError is that split_text only has 1 element when the line does not have the separator. IndexError的原因是当行没有分隔符时split_text只有 1 个元素。

You have to deal with this case.你必须处理这个案子。 Drop that line or choose a different processing.删除该行或选择不同的处理。

Another case if the line has multiple separators.另一种情况,如果该行有多个分隔符。 Marat had a nice solution for that case (see edit) Marat 有一个很好的解决方案(见编辑)

A few other refactor tips:其他一些重构技巧:

It is not needed to read the whole file before processing.在处理之前不需要读取整个文件。

To get faster processing do not open and close files hundreds of times.要获得更快的处理速度,请不要打开和关闭文件数百次。

Use a debugger to inspect the results of the split if they contain the new line character.如果拆分结果包含换行符,请使用调试器检查它们。

If you don't need any white space at the ends of the string you can strip() them of all white space of only the new line char with `strip('\n')如果您在字符串的末尾不需要任何空格,您可以使用 `strip('\n' strip()将它们从仅新行字符的所有空格中剥离()

And later add the new line for both written lines to keep them similar.然后为两条书面行添加新行以保持它们相似。

with open('mn_en_sentences_split.txtaa') as inputFile:
    with open("mn_out.txt", "w") as mn_out:
        with open("en_out.txt", "w") as en_out:
            for i in inputFile:
                split_text = map(lambda x:x.strip('\n'), i.split("+++++SEP+++++"))
                if len(split_text) < 2: continue  # drop line if no separator
                mn_out.write(split_text[0].rstrip() + "\n")
                en_out.write(split_text[1].lstrip() + "\n")

Edit编辑

Marat made a few suggestions to refactor and fail safe the execution in case the separator is not found. Marat提出了一些建议来重构和故障安全执行,以防找不到分隔符。 The 3 with statements can be joined together with (syntax sugar) to reduce the indentation (not supported in all versions of Python 3.x). 3 with语句可以与 (syntax sugar) 结合在一起以减少缩进(并非所有版本的 Python 3.x 都支持)。

I really like the variable unpacking of the split result.我真的很喜欢拆分结果的变量解包。 If it fails you get a ValueError exception.如果它失败了,你会得到一个ValueError异常。

I have chosen to skip the lines that do not have a separator.我选择跳过没有分隔符的行。 If you want to do something with these lines you have to put the write() calls outside/below the try/except and in the exception handler set mn and en to some value.如果您想对这些行执行某些操作,则必须将write()调用放在try/except之外/之下,并在异常处理程序中将mnen设置为某个值。

I like to keep the normal flow of code inside the try.我喜欢在 try 中保持正常的代码流。

What and how you want to strip from the strings is all up to you depending on what you want and what the input might contain.您想从字符串中删除什么以及如何删除完全取决于您想要什么以及输入可能包含的内容。

with open('mn_en_sentences_split.txtaa') as inputFile, \
     open("mn_out.txt", "w") as mn_out, \
     open("en_out.txt", "w") as en_out:
    for line in inputFile:
        try:
            mn, en = line.strip('\n').split("+++++SEP+++++", 1)
            mn_out.write(mn.rstrip() + "\n")
            en_out.write(en.lstrip() + "\n")
        except ValueError:
           pass

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM