简体   繁体   中英

Read and write csv in python - big large

I have a csv with 12288+1 coluns, and want to reduct to 4096+1 colums.

In this 12288+1 colums, they are same values on each three and the last value is a bit, 0 or 1.

I need to maintain a last value, and take just 1 for repetitive group of three.

And my original csv have 300 rows, or lines, whatever. I don't know how to do for catch others rows, and my script just take a first row/line.

from original csv 3,3,3,5,5,5,7,7,7,10,10,10 ... 20,20,20,50,50,50,1

want final csv 3,5,7,10 ... 20,50,1

import csv

count, num = 0
a = ''
with open('data.csv','rb') as filecsv:
    reader = csv.reader(filecsv)
    for row in reader:
        while count < 12290:
            a = a + str(row[:][count])+','
            count = count + 3
            num = num + 1
print num
print a

This prints just to have a idea.

Thanks for any help

If you don't mind using a library, Pandas will be able to do this for you nicely.

You can read a csv with pandas.read_csv. The use_cols parameter specifies which columns you want to keep, so you can use that to ignore these repeated columns.

columns = list(range(1,12288,3))
columns.append(12288)
data = pandas.read_csv('data.csv', usecols=columns)
data.to_csv('new_data.csv')

If they are always groups of three, just throw 2 away.

Group into groups of 3 like so:

>>> row=range(9)
>>> [row[i:i+3] for i in range(0,len(row),3)]
[[0, 1, 2], [3, 4, 5], [6, 7, 8]]

However, this will give you groups of less than 3 at the end if row is not a multiple of 3:

>>> row=range(11)
>>> [row[i:i+3] for i in range(0,len(row),3)]
[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10]]
                                    ^  ^   only two elements...

If the number of elements may be a non multiple of 3, use zip. It will drop incomplete r,g,b groups:

>>> row=range(11)
>>> zip(*[iter(row)]*3)
[(0, 1, 2), (3, 4, 5), (6, 7, 8)]

Then unpack into r,g,b components:

import csv

with open('data.csv','rb') as filecsv:
    reader = csv.reader(filecsv)
    for row in reader:
        for r, g, b in [row[i:i+3] for i in range(0,len(row),3)]:
            # use r or g or b, ignore the other two

If you are getting a ValueError you have a non multiple of 3 set of data (or csv is not parsing the data correctly) Try using zip as stated:

import csv

with open('data.csv','rb') as filecsv:
    reader = csv.reader(filecsv)
    for row in reader:
        for r, g, b in zip(*[iter(row)]*3):
            # use r or g or b, ignore the other two

(not tested...)

To remove consecutive duplicates, you could use itertools.groupby function :

#!/usr/bin/env python
import csv
from itertools import groupby
from operator import itemgetter

with open('data.csv', 'rb') as file, open('output.csv', 'wb') as output_file:
    writer = csv.writer(output_file)
    for row in csv.reader(file):
        writer.writerow(map(itemgetter(0), groupby(row)))

It reads the input csv file and writes it to the output csv file with consecutive duplicates removed.

If there could be adjacent duplicate 0 , 1 at the very end of the row then remove duplicates only in row[:-1] (all but last columns) and append the last bit row[-1] to the result if you want to preserve it:

from itertools import islice

no_dups = map(itemgetter(0), groupby(islice(row, len(row)-1)))
no_dups.append(row[-1])
writer.writerow(no_dups)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM