簡體   English   中英

Python:如何逐行驗證輸入文件,修復可能的錯誤,並將清理后的行寫入另一個文件?

[英]Python: How to validate input file line-by-line, fix possible errors, and write cleaned lines to another file?

我在文本文件中的行看起來像這樣:

data/processed/10/blueprint-0.png,1915.0,387.0,1933.0,402.0
data/processed/10/blueprint-0.png,3350.0,387.0,3353.0,388.0

1915位置處的數字應始終小於1933位置處的元素,並且387位置處的元素應始終小於402位置處的元素。

不幸的是,情況並非總是如此,因為我的數據並不完全干凈。 為了解決這個問題,我想創建另一個文件,我只是復制該行,如果它是正確的,並進行必要的調整,如果不是那樣將它固定在新文件中(我不想操縱原始文件中的數據) 。

我的代碼:

path = 'data/faulty.txt'
with open(path ) as f:
    with open('data/true_values.txt', 'a') as the_file:
        for line in f:
            numbers = re.findall(r'\d+', line)
            if numbers:
                if numbers[2] > numbers[6]:
                    temp = numbers[2]
                    numbers[2] = numbers[6]
                    numbers[6] = temp

                if numbers[4] > numbers[8]:
                    temp = numbers[2]
                    numbers[2] = numbers[6]
                    numbers[6] = temp

                the_file.write(line)

如何進行更改? 我還考慮過使用re.sub但無法使其工作。

沒有re例子:

input_filename = 'full_path_to_my_input_file.txt'
output_filename = 'full_path_to_my_output_file.txt'

with open(output_filename, 'a') as f_out:
    with open(input_filename, 'r') as f_in:
        for line in f_in:
            records = line.strip().split(',')
            if float(records[1]) > float(records[3]):
                records[1], records[3] = records[3], records[1]
            if float(records[2]) > float(records[4]):
                records[2], records[4] = records[4], records[2]
            f_out.write(','.join(records) + '\n')

輸入:

data/processed/10/blueprint-0.png,1915.0,387.0,1933.0,402.0
data/processed/10/blueprint-0.png,3353.0,389.0,3350.0,388.0
data/processed/10/blueprint-0.png,952.0,724.0,1010.0,734.0

輸出:

data/processed/10/blueprint-0.png,1915.0,387.0,1933.0,402.0
data/processed/10/blueprint-0.png,3350.0,388.0,3353.0,389.0 ## swapped !!
data/processed/10/blueprint-0.png,952.0,724.0,1010.0,734.0

我會在你去的時候把修改后的行寫到列表中,然后將列表寫到最后的文件中。 這樣你在處理第一個文件時就不會打開這兩個文件,這使得它更具原子性。 還修復了我之前錯過的正則表達式。

import re

input = "data/faulty.txt"
output = "data/true_values.txt"
new = []

with open(input) as f:
    for line in f:
        name, numberstr = line.split(',', 1)
        numbers = re.findall(r'\d+\.\d+|\d+', numberstr)
        if numbers:
            if numbers[0] > numbers[2]:
                numbers[0], numbers[2] = numbers[2], numbers[0]
            if numbers[1] > numbers[3]:
                numbers[1], numbers[3]  = numbers[3], numbers[1]

        new.append("{},{}".format(name, ','.join(numbers)))

with open(output, 'a') as the_file:
    for x in new:
        the_file.write(x + '\n')

我相信它可能不使用re

試試這個

with open(path) as f, open('output.txt', 'w') as outputFile:
        for line in f:
            lineArr = line.split(",")
            if float(lineArr[1])>float(lineArr[3]):
                lineArr[1], lineArr[3] = lineArr[3].replace("\n",""), lineArr[1].replace("\n","")
            if float(lineArr[2])>float(lineArr[4]):
                lineArr[2], lineArr[4] = lineArr[4].replace("\n",""), lineArr[2].replace("\n","")
            lineArr.append("\n")
            outputFile.write(",".join(lineArr))

當它們不是正確的工具時,請不要使用正則表達式。 你顯然有一個csv格式,所以使用csv模塊。 此外,您需要將“數字”轉換為實際數字 - 您所讀到的是字符串而不是數字。 最后,一旦解析並修復了一行,就必須在寫回之前從“固定”值重新創建新行:

# XXX untested code, may contains typos or small bugs
import csv

inpath = 'data/faulty.txt'
outpath = 'data/true_values.txt'
with open(inpath) as infile, open(outpath, 'a') as outpath:

    # please check the csv doc for the correct options for your file format
    reader = csv.reader(infile, delim=",")
    writer = csv.writer(outfile, delim=",")

    for row in reader:
        # split the path from the numbers
        imagepath, nums = row[0], row[1:]

        # convert numbers to floats so we have
        # meaningful comparisons
        nums = [float(num) for num in nums]

        # swap the numbers if necessary
        if nums[0] > nums[2]:
            nums[2], nums[0] = nums[0], nums[2] 
        if nums[1] > nums[3]:
            nums[3], nums[1] = nums[1], nums[3] 

        # recreate the fixed row and write it
        newrow = [imagepath] + nums
        writer.writerow(newrow)
path = 'data/faulty.txt'
with open(path, "r") as f, open('data/true_values.txt', "a") as the_file:
        for line in f:
                lineArr = line[:-1].split(",")    
                if float(lineArr[1])>float(lineArr[3]):
                        lineArr[1], lineArr[3] = lineArr[3], lineArr[1]
                if float(lineArr[2])>float(lineArr[4]):
                        lineArr[2], lineArr[4] = lineArr[4], lineArr[2]
                the_file.write(",".join(lineArr) + "\n")    

lineArr = line[:-1].split(",")因此,當你這樣做時,你沒有得到列表的最后一個元素的新行字符,它將輸入一個新的行字符,如果是最后一個數字被交換。 嘗試使用我提供的輸入來理解它的重要性lineArr = line.split(",")

使用split你會得到一個list ,其中可以使用slicing來獲取轉換為float的數據,並比較值,如果它們不是你想要的,它們將被swapped

data / faulty.txt:

data/processed/10/blueprint-0.png,1915.0,387.0,1933.0,402.0
data/processed/10/blueprint-0.png,1234.5,387.0,1222.1,380.0
data/processed/10/blueprint-0.png,3350.0,387.0,3353.0,388.0

運行python腳本后。

data / true_values.txt:

data/processed/10/blueprint-0.png,1915.0,387.0,1933.0,402.0
data/processed/10/blueprint-0.png,1222.1,380.0,1234.5,387.0  #Swapped
data/processed/10/blueprint-0.png,3350.0,387.0,3353.0,388.0

這應該工作(剝離文件訪問權限以便能夠復制問題):

input = ['data/processed/10/blueprint-0.png,1915.0,387.0,1933.0,402.0',
'data/processed/10/blueprint-0.png,3353.0,387.0,3350.0,388.0']

output = []
for input_line in input:
    numbers = input_line.split(',')
    if numbers:
        if float(numbers[1]) > float(numbers[3]):
            numbers[1], numbers[3] = numbers[3], numbers[1]

        if float(numbers[2]) > float(numbers[4]):
            numbers[2], numbers[4] = numbers[4], numbers[2]

    output.append(','.join(numbers))

print(output)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM