简体   繁体   English

写入文件时,python行会自行连接

[英]python lines concatenate itself when writing to a file

I'm using python to generate training and testing data for 10-fold cross-validations, and to write the datasets to 2x10 separated files (each fold writes a training file and a testing file). 我正在使用python生成10倍交叉验证的训练和测试数据,并将数据集写入2x10分离的文件中(每次折叠都写入一个训练文件和一个测试文件)。 And the weird thing is that when writing data to a file, there always is a line "missing". 奇怪的是,当将数据写入文件时,总会有一行“丢失”。 Actually, it might not even be "missing", since I discovered later that some line (only one line) in the middle of the file gets to concatenate itself to its previous line. 实际上,它甚至可能都不是“丢失”的,因为我后来发现文件中间的某些行(仅一行)将自己连接到其前一行。 So an output file should be something like the following (there should be 39150 lines in total): 因此,输出文件应类似于以下内容(总共应该有39150行):

44 1 90 0 44 0 45 46 0 1
55 -3 95 0 44 22 40 51 12 4
50 -3 81 0 50 0 31 32 0 1
44 -4 76 0 42 -30 32 34 2 1

However, I keep getting 39149 lines, and somewhere in the middle of the file seems to mess up like this: 但是,我一直得到39149行,并且文件中间的某处似乎像这样混乱:

44 1 90 0 44 0 45 46 0 1
55 -3 95 0 44 22 40 51 12 450 -3 81 0 50 0 31 32 0 1
44 -4 76 0 42 -30 32 34 2 1

My code: 我的代码:

def k_fold(myfile, myseed=1, k=10):
    # Load data
    data = open(myfile).readlines()

    # Shuffle input
    random.seed = myseed
    random.shuffle(data)

    # Compute partition size given input k
    len_total = len(data)
    len_part = int(math.floor(len_total / float(k)))

    # Create one partition per fold
    train = {}
    test = {}
    for i in range(k):
        test[i] = data[i * len_part:(i + 1) * len_part]
        train[i] = data[0:i * len_part] + data[(i + 1) * len_part:len_total]

    return train, test

if __name__ == "__main__":
    path = '....'  #some path and input
    input = '...'

    # Generate data
    [train, test] = k_fold(input)

    # Write data to files
    for i in range(10):
        train_old = path + 'tmp_train_' + str(i)
        test_old = path + 'tmp_test_' + str(i)

        trainF = open(train_old, 'a')
        testF = open(test_old, 'a')

        print(len(train[i]))

The strange thing is that I'm doing the same thing for the training and the testing dataset. 奇怪的是,我在训练和测试数据集上都做同样的事情。 The testing dataset outputs the correct file (4350 lines), but the training dataset has the above problem. 测试数据集输出正确的文件(4350行),但是训练数据集存在上述问题。 I'm sure that the function returns the 39150 lines of training data, so I think the problem should be in the file writing part. 我确定该函数返回了39150行训练数据,因此我认为问题应该出在文件编写部分。 Any body has any ideas how I could possibly done wrong? 任何机构都有任何想法,我怎么可能做错了? Thanks in advance! 提前致谢!

I assume that the first half of the double length line is the last line of the original file. 我假设双倍长度行的前半部分是原始文件的最后一行。

The lines returned by readlines (or by iterating over the file) will all still end with the LF character '\\n' except the last line if the file doesn't end with an empty line. readlines (或通过遍历文件)返回的行都将以LF字符'\\n'结束, 除非文件的末尾没有空行, 否则最后一行除外 In that case, the shuffling that you do will put that '\\n' -less line somewhere in the middle of 'data' . 在这种情况下,您进行的改组将把'\\n' n'-的那一行放在'data'中间。

Either append an empty line to your original file or strip all lines before processing and add the newline to each line when writing back to a file. 在处理之前将空白行添加到原始文件中,或在所有行之前strip所有行,并在写回文件时在每行中添加换行符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM