简体   繁体   中英

Python: How to validate input file line-by-line, fix possible errors, and write cleaned lines to another file?

My lines in the text file look like this :

data/processed/10/blueprint-0.png,1915.0,387.0,1933.0,402.0
data/processed/10/blueprint-0.png,3350.0,387.0,3353.0,388.0

The number at 1915 location should always be smaller than the element at 1933 location, and the element at 387 location should always be smaller than the element at 402 location.

Unfortunately it isn't always the case as my data isn't perfectly clean. To fix that, I want to create another file where I just copy the line if it's correct and do the necessary adjustment fixing it in the new file if it isn't (I don't want to manipulate the data in the original file).

My code:

path = 'data/faulty.txt'
with open(path ) as f:
    with open('data/true_values.txt', 'a') as the_file:
        for line in f:
            numbers = re.findall(r'\d+', line)
            if numbers:
                if numbers[2] > numbers[6]:
                    temp = numbers[2]
                    numbers[2] = numbers[6]
                    numbers[6] = temp

                if numbers[4] > numbers[8]:
                    temp = numbers[2]
                    numbers[2] = numbers[6]
                    numbers[6] = temp

                the_file.write(line)

How can I make the change into the line? I also thought about using re.sub but couldn't manage to make it work.

example without re :

input_filename = 'full_path_to_my_input_file.txt'
output_filename = 'full_path_to_my_output_file.txt'

with open(output_filename, 'a') as f_out:
    with open(input_filename, 'r') as f_in:
        for line in f_in:
            records = line.strip().split(',')
            if float(records[1]) > float(records[3]):
                records[1], records[3] = records[3], records[1]
            if float(records[2]) > float(records[4]):
                records[2], records[4] = records[4], records[2]
            f_out.write(','.join(records) + '\n')

input:

data/processed/10/blueprint-0.png,1915.0,387.0,1933.0,402.0
data/processed/10/blueprint-0.png,3353.0,389.0,3350.0,388.0
data/processed/10/blueprint-0.png,952.0,724.0,1010.0,734.0

output:

data/processed/10/blueprint-0.png,1915.0,387.0,1933.0,402.0
data/processed/10/blueprint-0.png,3350.0,388.0,3353.0,389.0 ## swapped !!
data/processed/10/blueprint-0.png,952.0,724.0,1010.0,734.0

I'd write the modified lines to a list as you go, then write the list out to a file at the end. that way you're not holding both files open while you process the first, which makes it more atomic an operation. Have also fixed to make the regex hand'e floats which I missed earlier.

import re

input = "data/faulty.txt"
output = "data/true_values.txt"
new = []

with open(input) as f:
    for line in f:
        name, numberstr = line.split(',', 1)
        numbers = re.findall(r'\d+\.\d+|\d+', numberstr)
        if numbers:
            if numbers[0] > numbers[2]:
                numbers[0], numbers[2] = numbers[2], numbers[0]
            if numbers[1] > numbers[3]:
                numbers[1], numbers[3]  = numbers[3], numbers[1]

        new.append("{},{}".format(name, ','.join(numbers)))

with open(output, 'a') as the_file:
    for x in new:
        the_file.write(x + '\n')

i believe its possible without using re

try running this

with open(path) as f, open('output.txt', 'w') as outputFile:
        for line in f:
            lineArr = line.split(",")
            if float(lineArr[1])>float(lineArr[3]):
                lineArr[1], lineArr[3] = lineArr[3].replace("\n",""), lineArr[1].replace("\n","")
            if float(lineArr[2])>float(lineArr[4]):
                lineArr[2], lineArr[4] = lineArr[4].replace("\n",""), lineArr[2].replace("\n","")
            lineArr.append("\n")
            outputFile.write(",".join(lineArr))

Don't use regexps when they're not the right tool. You obviously have a csv format, so use the csv module. Also, you need to convert your "numbers" to actual numbers - what you read in are strings not numbers. And finally, once you have parsed and possibly fixed a line, you have to recreate the new one from "fixed" values before you write it back:

# XXX untested code, may contains typos or small bugs
import csv

inpath = 'data/faulty.txt'
outpath = 'data/true_values.txt'
with open(inpath) as infile, open(outpath, 'a') as outpath:

    # please check the csv doc for the correct options for your file format
    reader = csv.reader(infile, delim=",")
    writer = csv.writer(outfile, delim=",")

    for row in reader:
        # split the path from the numbers
        imagepath, nums = row[0], row[1:]

        # convert numbers to floats so we have
        # meaningful comparisons
        nums = [float(num) for num in nums]

        # swap the numbers if necessary
        if nums[0] > nums[2]:
            nums[2], nums[0] = nums[0], nums[2] 
        if nums[1] > nums[3]:
            nums[3], nums[1] = nums[1], nums[3] 

        # recreate the fixed row and write it
        newrow = [imagepath] + nums
        writer.writerow(newrow)
path = 'data/faulty.txt'
with open(path, "r") as f, open('data/true_values.txt', "a") as the_file:
        for line in f:
                lineArr = line[:-1].split(",")    
                if float(lineArr[1])>float(lineArr[3]):
                        lineArr[1], lineArr[3] = lineArr[3], lineArr[1]
                if float(lineArr[2])>float(lineArr[4]):
                        lineArr[2], lineArr[4] = lineArr[4], lineArr[2]
                the_file.write(",".join(lineArr) + "\n")    

lineArr = line[:-1].split(",") So that you don't get the new line character with the last element of the list else when you do this it will input a new line character when and if the last number is swapped. Try this on the inputs I provided to understand it's importance lineArr = line.split(",")

Using split you get a list in which slicing can be used to get the data which is converted into float and value is compared and if they are not what you wanted they will be swapped .

data/faulty.txt :

data/processed/10/blueprint-0.png,1915.0,387.0,1933.0,402.0
data/processed/10/blueprint-0.png,1234.5,387.0,1222.1,380.0
data/processed/10/blueprint-0.png,3350.0,387.0,3353.0,388.0

After running the python script.

data/true_values.txt :

data/processed/10/blueprint-0.png,1915.0,387.0,1933.0,402.0
data/processed/10/blueprint-0.png,1222.1,380.0,1234.5,387.0  #Swapped
data/processed/10/blueprint-0.png,3350.0,387.0,3353.0,388.0

This should work (stripped away the file access to be able to replicate the problem):

input = ['data/processed/10/blueprint-0.png,1915.0,387.0,1933.0,402.0',
'data/processed/10/blueprint-0.png,3353.0,387.0,3350.0,388.0']

output = []
for input_line in input:
    numbers = input_line.split(',')
    if numbers:
        if float(numbers[1]) > float(numbers[3]):
            numbers[1], numbers[3] = numbers[3], numbers[1]

        if float(numbers[2]) > float(numbers[4]):
            numbers[2], numbers[4] = numbers[4], numbers[2]

    output.append(','.join(numbers))

print(output)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM