Python：如何逐行驗證輸入文件，修復可能的錯誤，並將清理后的行寫入另一個文件？

Question

我在文本文件中的行看起來像這樣：

data/processed/10/blueprint-0.png,1915.0,387.0,1933.0,402.0
data/processed/10/blueprint-0.png,3350.0,387.0,3353.0,388.0

1915位置處的數字應始終小於1933位置處的元素，並且387位置處的元素應始終小於402位置處的元素。

不幸的是，情況並非總是如此，因為我的數據並不完全干凈。 為了解決這個問題，我想創建另一個文件，我只是復制該行，如果它是正確的，並進行必要的調整，如果不是那樣將它固定在新文件中（我不想操縱原始文件中的數據）。

我的代碼：

path = 'data/faulty.txt'
with open(path ) as f:
    with open('data/true_values.txt', 'a') as the_file:
        for line in f:
            numbers = re.findall(r'\d+', line)
            if numbers:
                if numbers[2] > numbers[6]:
                    temp = numbers[2]
                    numbers[2] = numbers[6]
                    numbers[6] = temp

                if numbers[4] > numbers[8]:
                    temp = numbers[2]
                    numbers[2] = numbers[6]
                    numbers[6] = temp

                the_file.write(line)

如何進行更改？ 我還考慮過使用re.sub但無法使其工作。

Answer 1

沒有re例子：

input_filename = 'full_path_to_my_input_file.txt'
output_filename = 'full_path_to_my_output_file.txt'

with open(output_filename, 'a') as f_out:
    with open(input_filename, 'r') as f_in:
        for line in f_in:
            records = line.strip().split(',')
            if float(records[1]) > float(records[3]):
                records[1], records[3] = records[3], records[1]
            if float(records[2]) > float(records[4]):
                records[2], records[4] = records[4], records[2]
            f_out.write(','.join(records) + '\n')

輸入：

data/processed/10/blueprint-0.png,1915.0,387.0,1933.0,402.0
data/processed/10/blueprint-0.png,3353.0,389.0,3350.0,388.0
data/processed/10/blueprint-0.png,952.0,724.0,1010.0,734.0

輸出：

data/processed/10/blueprint-0.png,1915.0,387.0,1933.0,402.0
data/processed/10/blueprint-0.png,3350.0,388.0,3353.0,389.0 ## swapped !!
data/processed/10/blueprint-0.png,952.0,724.0,1010.0,734.0

Answer 2

我會在你去的時候把修改后的行寫到列表中，然后將列表寫到最后的文件中。 這樣你在處理第一個文件時就不會打開這兩個文件，這使得它更具原子性。 還修復了我之前錯過的正則表達式。

import re

input = "data/faulty.txt"
output = "data/true_values.txt"
new = []

with open(input) as f:
    for line in f:
        name, numberstr = line.split(',', 1)
        numbers = re.findall(r'\d+\.\d+|\d+', numberstr)
        if numbers:
            if numbers[0] > numbers[2]:
                numbers[0], numbers[2] = numbers[2], numbers[0]
            if numbers[1] > numbers[3]:
                numbers[1], numbers[3]  = numbers[3], numbers[1]

        new.append("{},{}".format(name, ','.join(numbers)))

with open(output, 'a') as the_file:
    for x in new:
        the_file.write(x + '\n')

Answer 3

我相信它可能不使用re

試試這個

with open(path) as f, open('output.txt', 'w') as outputFile:
        for line in f:
            lineArr = line.split(",")
            if float(lineArr[1])>float(lineArr[3]):
                lineArr[1], lineArr[3] = lineArr[3].replace("\n",""), lineArr[1].replace("\n","")
            if float(lineArr[2])>float(lineArr[4]):
                lineArr[2], lineArr[4] = lineArr[4].replace("\n",""), lineArr[2].replace("\n","")
            lineArr.append("\n")
            outputFile.write(",".join(lineArr))

Answer 4

當它們不是正確的工具時，請不要使用正則表達式。 你顯然有一個csv格式，所以使用csv模塊。 此外，您需要將“數字”轉換為實際數字 - 您所讀到的是字符串而不是數字。 最后，一旦解析並修復了一行，就必須在寫回之前從“固定”值重新創建新行：

# XXX untested code, may contains typos or small bugs
import csv

inpath = 'data/faulty.txt'
outpath = 'data/true_values.txt'
with open(inpath) as infile, open(outpath, 'a') as outpath:

    # please check the csv doc for the correct options for your file format
    reader = csv.reader(infile, delim=",")
    writer = csv.writer(outfile, delim=",")

    for row in reader:
        # split the path from the numbers
        imagepath, nums = row[0], row[1:]

        # convert numbers to floats so we have
        # meaningful comparisons
        nums = [float(num) for num in nums]

        # swap the numbers if necessary
        if nums[0] > nums[2]:
            nums[2], nums[0] = nums[0], nums[2] 
        if nums[1] > nums[3]:
            nums[3], nums[1] = nums[1], nums[3] 

        # recreate the fixed row and write it
        newrow = [imagepath] + nums
        writer.writerow(newrow)

Answer 5

path = 'data/faulty.txt'
with open(path, "r") as f, open('data/true_values.txt', "a") as the_file:
        for line in f:
                lineArr = line[:-1].split(",")    
                if float(lineArr[1])>float(lineArr[3]):
                        lineArr[1], lineArr[3] = lineArr[3], lineArr[1]
                if float(lineArr[2])>float(lineArr[4]):
                        lineArr[2], lineArr[4] = lineArr[4], lineArr[2]
                the_file.write(",".join(lineArr) + "\n")

lineArr = line[:-1].split(",")因此，當你這樣做時，你沒有得到列表的最后一個元素的新行字符，它將輸入一個新的行字符，如果是最后一個數字被交換。 嘗試使用我提供的輸入來理解它的重要性lineArr = line.split(",")

使用split你會得到一個list ，其中可以使用slicing來獲取轉換為float的數據，並比較值，如果它們不是你想要的，它們將被swapped 。

data / faulty.txt：

data/processed/10/blueprint-0.png,1915.0,387.0,1933.0,402.0
data/processed/10/blueprint-0.png,1234.5,387.0,1222.1,380.0
data/processed/10/blueprint-0.png,3350.0,387.0,3353.0,388.0

運行python腳本后。

data / true_values.txt：

data/processed/10/blueprint-0.png,1915.0,387.0,1933.0,402.0
data/processed/10/blueprint-0.png,1222.1,380.0,1234.5,387.0  #Swapped
data/processed/10/blueprint-0.png,3350.0,387.0,3353.0,388.0

Answer 6

這應該工作（剝離文件訪問權限以便能夠復制問題）：

input = ['data/processed/10/blueprint-0.png,1915.0,387.0,1933.0,402.0',
'data/processed/10/blueprint-0.png,3353.0,387.0,3350.0,388.0']

output = []
for input_line in input:
    numbers = input_line.split(',')
    if numbers:
        if float(numbers[1]) > float(numbers[3]):
            numbers[1], numbers[3] = numbers[3], numbers[1]

        if float(numbers[2]) > float(numbers[4]):
            numbers[2], numbers[4] = numbers[4], numbers[2]

    output.append(','.join(numbers))

print(output)

Python：如何逐行驗證輸入文件，修復可能的錯誤，並將清理后的行寫入另一個文件？

問題描述

6 個解決方案

解決方案1
3 已采納 2018-02-16 11:22:32

解決方案2
1 2018-02-16 11:18:36

解決方案3
1 2018-02-16 11:19:54

解決方案4
1 2018-02-16 11:33:06

解決方案5
0 2018-02-16 11:27:06

解決方案6
0 2018-02-16 11:33:25

Python：如何逐行驗證輸入文件，修復可能的錯誤，並將清理后的行寫入另一個文件？

問題描述

6 個解決方案

解決方案1 3 已采納 2018-02-16 11:22:32

解決方案2 1 2018-02-16 11:18:36

解決方案3 1 2018-02-16 11:19:54

解決方案4 1 2018-02-16 11:33:06

解決方案5 0 2018-02-16 11:27:06

解決方案6 0 2018-02-16 11:33:25

解決方案1
3 已采納 2018-02-16 11:22:32

解決方案2
1 2018-02-16 11:18:36

解決方案3
1 2018-02-16 11:19:54

解決方案4
1 2018-02-16 11:33:06

解決方案5
0 2018-02-16 11:27:06

解決方案6
0 2018-02-16 11:33:25