简体   繁体   中英

txt file sorting(key:value in every line) - a problem with '\n'

I am trying to sort txt file which looks like that:

 byr:1983 iyr:2017 pid:796082981 cid:129 eyr:2030 ecl:oth hgt:182cm iyr:2019 cid:314 eyr:2039 hcl:#cfa07d hgt:171cm ecl:#0180ce byr:2006 pid:8204115568 byr:1991 eyr:2022 hcl:#341e13 iyr:2016 pid:729933757 hgt:167cm ecl:gry hcl:231d64 cid:124 ecl:gmt eyr:2039 hgt:189in pid:#9c3ea1

and so on(+1000 lines), to that structure:

 byr:value iyr:value eyr:value hgt:value hcl:value ecl:value pid:value cid:value byr:value iyr:value eyr:value hgt:value hcl:value ecl:value pid:value cid:value

byr, iyr etc. order doesn't matter, but every "set" of key:value has to be seperated by blank line. My main problem, if I can call it that way, is to create piece of code that sorts the file properly when there is more than one key:value element, I managed to make some progress, but it is still not as it should be - the following code:

result_file = open('testresult.txt', 'w')
#list_of_lines = [] testing purpose


with open('input.txt', 'r') as f:
    for line in f:
        if line == "\n":
            #list_of_lines.append('\n') testing
            result_file.writelines('\n')
        else:
            for i in line.split(' '):
                if i[-1] == "n":
                    result_file.write(i)
                else:
                    result_file.write(i + '\n')
                #print(i) testing purpose

is making result as below:

byr:1983
iyr:2017

pid:796082981
cid:129
eyr:2030

ecl:oth
hgt:182cm


iyr:2019

cid:314

eyr:2039
hcl:#cfa07d
hgt:171cm
ecl:#0180ce
byr:2006
pid:8204115568


byr:1991
eyr:2022
hcl:#341e13
iyr:2016
pid:729933757
hgt:167cm
ecl:gry

and as you can see it doesn't work properly - for example there should be no blank line between first occurrence of byr and first occurrence of hgt and so on. It seemed to me that the last if statement

if i[-1] == "n":
    result_file.write(i)
else:
    result_file.write(i + '\n')

is protecting me from such situation, but now I totally don't get why isn't it as I "predicted". Please help. Thanks from advance <3

Try this -

result_file = open('testresult.txt', 'w')
#list_of_lines = [] testing purpose


with open('input.txt', 'r') as f:
    for line in f:
        if line == '\n':
            #list_of_lines.append('\n') testing
            result_file.writelines('\n')
        else:
            # replace '\n' with ''
            line = line.replace('\n', '')
            for i in line.split(' '):
                result_file.writelines(i + '\n')

result_file.close()

Try this

lines = []
with open("file.txt", "r") as f:
    lines = f.readlines()

print(lines)

splited_lines = []

for line in lines:
    [ splited_lines.append(splited) for splited in line.split(" ")]

print("splitted_lines")
print(splited_lines)

# notice every occurence in splitted_lines has a '\n', 
# that might be causing your more then on newline problem,
# lets remove that

cleaned_lines = []

[cleaned_lines.append(splited.strip("\n")) for splited in splited_lines]

print("Removed /n")
print(cleaned_lines)

with open("output.txt", "w") as f:
    for line in cleaned_lines:
        f.write(line+"\n")

Having this in file.txt:

byr:1983 iyr:2017
pid:796082981 cid:129 eyr:2030
ecl:oth hgt:182cm

iyr:2019
cid:314
eyr:2039 hcl:#cfa07d hgt:171cm ecl:#0180ce byr:2006 pid:8204115568

byr:1991 eyr:2022 hcl:#341e13 iyr:2016 pid:729933757 hgt:167cm ecl:gry

hcl:231d64 cid:124 ecl:gmt eyr:2039
hgt:189in
pid:#9c3ea1

Running the above script gives me this in output.txt:

byr:1983
iyr:2017
pid:796082981
cid:129
eyr:2030
ecl:oth
hgt:182cm

iyr:2019
cid:314
eyr:2039
hcl:#cfa07d
hgt:171cm
ecl:#0180ce
byr:2006
pid:8204115568

byr:1991
eyr:2022
hcl:#341e13
iyr:2016
pid:729933757
hgt:167cm
ecl:gry

hcl:231d64
cid:124
ecl:gmt
eyr:2039
hgt:189in
pid:#9c3ea1

Hope this is what you needed?

You can delete all \n 's with replace .

result_file = open('testresult.txt', 'w')
#list_of_lines = [] testing purpose


with open('input.txt', 'r') as f:
    for line in f:
        line = line.replace('\n', '')
        if line != '':
            for i in line.split(' '):
                result_file.write(i+'\n')

And this is result:

byr:1983
iyr:2017
pid:796082981
cid:129
eyr:2030
ecl:oth
hgt:182cm
iyr:2019
cid:314
eyr:2039
hcl:#cfa07d
hgt:171cm
ecl:#0180ce
byr:2006
pid:8204115568
byr:1991
eyr:2022
hcl:#341e13
iyr:2016
pid:729933757
hgt:167cm
ecl:gry
hcl:231d64
cid:124
ecl:gmt
eyr:2039
hgt:189in
pid:#9c3ea1

A regular expression may be useful to achieve your result without being annoyed by the end of line character.

Assuming there are no whitespaces in your pairs you could use the following script:

import re
from contextlib import ExitStack

REGEX = re.compile(r"[^:\s]+:\S+")
with ExitStack() as stack:
    fr = stack.enter_context(open(input, encoding="UTF_8"))
    fw = stack.enter_context(open(output, mode="w", encoding="UTF_8"))
    for line in fr:
        match = REGEX.match(line)
        if not match:
            fw.write("\n")
            continue
        for item in REGEX.findall(line):
            fw.write(f"{item}\n")

The regular expression helps you to search for " anything which is not a semi-colon, nor a whitespace character, followed by a semi-colon. Followed then by anything which is not a whitespace character ". That allows the script to focus on pairs only.

Whitespace characters include spaces, tabs and end of line characters.

The ExitStack feature helps to optimize the use of two context managers.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM