简体   繁体   中英

remove characters and duplicates from csv file and write to new file

I'm reading from a csv file that looks like this:

[152.60115606936415][152.60115606936415, 13181.818181818182][152.60115606936415, 13181.818181818182, 1375055.330634278][152.60115606936415, 13181.818181818182, 1375055.330634278, 89.06882591093118]

what I want to do is to remove the characters ([,] and spaces to a new line) and write it into my new txt fil

import csv
to_file =open("t_put.txt","w")
with open("t_put_val.20181026052328.csv", "r") as f:
   for row in (list(csv.reader(f))):
   value2= (" ".join(row)[1:-1]) #remove 3 first and last elements
   value = value2.replace("  ","\n")# replace spaces with newline
   value3 = value.replace("]["," ") # replace ][
   value4 = value3.replace(" ","\n")
   print(value4)
  # st = str(s)
   to_file.write(value4)#write to file
to_file.close()

With this code I am able to remove the characters, but still duplicates are showing up. I was thinking of putting using the set() method , but its not working as intended or just print out the four last digits, but might not work for larger data set

By splitting by ']', you can group each of the lists that reside inside the csv.

# Open up the csv file
with open("t_put_val.20181026052328.csv", "r") as f_h:
    rows = [row.lstrip('[').split(", ")
            # For each line in the file (there's just one)
            for line in f_h.readlines()
            # Dont' want a blank line
            if not len(line) == 0
            # Split the line by trailing ']'s
            for row in line.split(']')
            # Don't want the last blank list
            if not len(row) == 0
            ]

# Print out all unique values
unique_values = set(item for row in rows for item in row)
[print(value) for value in unique_values];

# Output
with open("t_put.txt", 'w') as f_h:
    f_h.writelines('%s\n' % ', '.join(row) for row in rows)

A set is an unordered data structure.

Better way to convert your String output to list object and then use python set() method which is mean for this:

>>> my_int = [152.60115606936415, 13181.818181818182, 152.60115606936415, 13181.818181818182, 1375055.330634278, 152.60115606936415]

You can use set directly to list in order to get the duplicate removed..

>>> set(my_int)
{152.60115606936415, 13181.818181818182, 1375055.330634278}

However, if you Don't wish to choose above and rather want list output then can opt like below...

>>> list(set(my_int))
[152.60115606936415, 13181.818181818182, 1375055.330634278]

Using the collections.OrderedDict ..

As per the conversation required output should be in ordered form hence using the OrderedDict to preserve the order of the dataset.

from collections import OrderedDict
import csv
to_file =open("ttv","w")
with open("tt", "r") as f:
    for row in (list(csv.reader(f))):
         value2= (" ".join(row)[1:-1]) #remove 3 first and last elements
         value = value2.replace("  ","\n")# replace spaces with newline
         value3 = value.replace("]["," ") # replace ][
         value4 = value3.replace(" ","\n")
         value4 = OrderedDict.fromkeys(value4.split())
         #value4 = sorted(set(value4.split()))
         for line in value4:
             line = line.split(',')
             for lines in line:
                 new_val = lines
                 print(new_val)
                 to_file.write(new_val + '\n')#write to file
to_file.close()

result:

152.60115606936415
13181.818181818182
1375055.330634278
89.06882591093118

If I'm correct in assuming that you just want to write every unique value to a new line in your output file, this will also preserve the original order:

from collections import OrderedDict

with open('t_put_val.20181026052328.csv', 'r') as infile, open('t_put.txt', 'w') as outfile:
data = infile.read()
# List of characters to replace
to_replace = ['[', ']', ' ']
for char in to_replace:
    if char in data:
        data = data.replace(char, '')
unique_list = list(OrderedDict.fromkeys(data.split(',')))
for i in unique_list:
    outfile.write(i + '\n')

Yields this in the txt file:

152.60115606936415
13181.818181818182
1375055.330634278
89.06882591093118

You can use your script in a way given below combined with linux command line as follows: If you compile your script the answer would be:

./yourscript.py

152.60115606936415
152.60115606936415
13181.818181818182
152.60115606936415
13181.818181818182
1375055.330634278
152.60115606936415
13181.818181818182
1375055.330634278
89.06882591093118

But if you use pipes in shell and write your output to a file then duplicates can be removed easily as follows:

./yourscript.py |sort|uniq > yourresultfile

If you see the results of your file it will look as

cat yourresultfile
13181.818181818182
1375055.330634278
152.60115606936415
89.06882591093118

In this way you can remove duplicates from your file.

So if you want pythonic way for doing this then below is rather silly way of achieving your desired output:

#!/usr/bin/python
import json
with open('input_file.txt', 'r') as myfile:
     data=myfile.read().replace('\n', '')

str1= data.replace('[','')
str2= str1.replace(']',',')
list1=str2.split(',')
list2=list(set(k))
list3=[x.strip() for x in list2 if x.strip()]
list4=[float(i) for i in list3]
with open('out_put_file.txt','w') as f:
     f.write(json.dumps(list4))

The file out_put_file.txt contains output as follows:

[13181.818181818182, 1375055.330634278, 89.06882591093118, 152.60115606936415]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM