简体   繁体   中英

Splitting, merging, sorting CSV

I have several CSV files containing measurements from several sensors

s1.CSV :

date;hour;source;values
01/25/12;10:20:00;a; 88 -84 27
01/25/12;10:30:00;a; -80
01/25/12;10:50:00;b; -96 3 -88
01/25/12;09:00:00;b; -97 101
01/25/12;09:10:00;c; 28

s2.CSV :

date;hour;source;values
01/25/12;10:20:00;a; 133
01/25/12;10:25:00;a; -8 -5

I'd like to create one CSV by source (a/b/c) with every measure in separated column sorted by date and hour

a.CSV :

date;hour;source;s1;s2
01/25/12;10:20:00;a; 88 -84 27; 133
01/25/12;10:25:00;a; ; -8 -5
01/25/12;10:30:00;a; -80;

...

I'm stuck here :

import glob
import csv
import os
os.system('cls')

sources = dict()
sensor = 0

filelist = glob.glob("*.csv")

for f in filelist:
    reader = csv.DictReader(open(f),delimiter=";")
    for row in reader:
#       date = row['date'] # date later
        hour = row['hour']
        val = row['values']
        source = row['source']

        if not sources.has_key(source): # new source
            sources[source] = list()
#       
        sources[source].append({'hour':hour, 'sensor'+`sensor`:val})

    sensor+=1

I'm not sure the data structure is good to sort. I also fell like repeating column name.

Using your data provided, I cooked up something using Pandas. Please see code below.

The output, granted, is non-ideal, as the hour and source get repeated within a column. As I am learning too, I'd also welcome any expert input on whether Pandas can do what the OP is asking for!

In [1]: import pandas as pd

In [2]: s1 = pd.read_csv('s1.csv', delimiter=';', parse_dates=True)

In [3]: s1
Out[3]: 
       date      hour source      values
0  01/25/12  10:20:00      a   88 -84 27
1  01/25/12  10:30:00      a         -80
2  01/25/12  10:50:00      b   -96 3 -88
3  01/25/12  09:00:00      b     -97 101
4  01/25/12  09:10:00      c          28

In [4]: s2 = pd.read_csv('s2.csv', delimiter=';', parse_dates=True)

In [5]: s2
Out[5]: 
       date      hour source  values
0  01/25/12  10:20:00      a     133
1  01/25/12  10:25:00      a   -8 -5

In [6]: joined = s1.append(s2)

In [7]: joined
Out[7]: 
       date      hour source      values
0  01/25/12  10:20:00      a   88 -84 27
1  01/25/12  10:30:00      a         -80
2  01/25/12  10:50:00      b   -96 3 -88
3  01/25/12  09:00:00      b     -97 101
4  01/25/12  09:10:00      c          28
0  01/25/12  10:20:00      a         133
1  01/25/12  10:25:00      a       -8 -5

In [8]: grouped = joined.groupby('hour').sum() 

In [9]: grouped.to_csv('a.csv')

In [10]: grouped
Out[10]: 
                      date source          values
hour                                             
09:00:00          01/25/12      b         -97 101
09:10:00          01/25/12      c              28
10:20:00  01/25/1201/25/12     aa   88 -84 27 133
10:25:00          01/25/12      a           -8 -5
10:30:00          01/25/12      a             -80
10:50:00          01/25/12      b       -96 3 -88

If I understand correctly, you have multiple files, each corresponding to a given "sensor", with the identity of the sensor in the filename. You want to read the files, then write them out in to separate files again, this time divided by "source", with the data from the different sensors combined into several final rows.

Here's what I think you want to do:

  1. Read the data in, and build a nested dictionary data structure, as follows:
  2. The top level key would be the source (eg 'a' ).
  3. The second level will be keyed by a (date, time) tuple.
  4. The inner most level will be keyed by sensor, taken from the filename, and have the actual sensor readings as values.
  5. You'd also want to keep track of all the sensors that have been seen.
  6. To write the data out, you'd loop over the items of the outermost dictionary, creating a new output file for each one.
  7. The rows of each file would be determined by sorting the keys of the next dictionary.
  8. The last value of each row would be formed by concatenating the values of the innermost dict, filling in an empty string for any missing values.

Here's some code:

from collections import defaultdict
from datetime import datetime
import csv
import glob
import os

# data structure is data[source][date, time][sensor] = value, with "" as default value
data = defaultdict(lambda: defaultdict(lambda: defaultdict(str)))
sensors = []

filelist = glob.glob("*.csv")

# read old files
for fn in filelist:
    sensor = os.path.splitext(fn)[0]
    sensors.append(sensor)
    with open(fn, 'rb') as f:
        reader = csv.DictReader(f, delimiter=";")
        for row in reader:
            date = datetime.strptime(row['date'], '%m/%d/%y')
            data[row['source']][date, row['hour']][sensor] = row['values']

sensors.sort() # note, this may not give the best sort order
header = ['date', 'hour', 'source'] + sensors

for source, source_data in data.iteritems():
    fn = "{}.csv".format(source)
    with open(fn, 'wb') as f:
        writer = csv.writer(f, delimiter=";")
        writer.writerow(header)
        for (date, time), hour_data in sorted(source_data.items()):
            values = [hour_data[sensor] for sensor in sensors]
            writer.writerow([date.strftime('%m/%d/%y'), time, source] + values)

I only convert the date field to an internal type because otherwise sorting based on dates won't work correctly (dates in January 2013 would appear before those in February 2012). In the future, consider using ISO 8601 style date formating, YYYY-MM-DD , which can be safely sorted as a string. The rest of the values are handled only as strings with no interpretation.

The code assumes that the sensor values can be ordered lexicographically. This is likely if you only have a few of them, eg s1 and s2 . However, if you have a s10 , it will be sorted ahead of s2 . To solve this you'll need a "natural" sort, which is more complicated than I can solve here (but see this recent question for more info).

One final warning: This solution may do bad things if you run it mutliple times in the same folder. That's because the output files, eg a.csv will be seen by glob.glob('*.csv') as input files when you run again.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM