I have several CSV files containing measurements from several sensors
s1.CSV :
date;hour;source;values
01/25/12;10:20:00;a; 88 -84 27
01/25/12;10:30:00;a; -80
01/25/12;10:50:00;b; -96 3 -88
01/25/12;09:00:00;b; -97 101
01/25/12;09:10:00;c; 28
s2.CSV :
date;hour;source;values
01/25/12;10:20:00;a; 133
01/25/12;10:25:00;a; -8 -5
I'd like to create one CSV by source (a/b/c) with every measure in separated column sorted by date and hour
a.CSV :
date;hour;source;s1;s2
01/25/12;10:20:00;a; 88 -84 27; 133
01/25/12;10:25:00;a; ; -8 -5
01/25/12;10:30:00;a; -80;
...
I'm stuck here :
import glob
import csv
import os
os.system('cls')
sources = dict()
sensor = 0
filelist = glob.glob("*.csv")
for f in filelist:
reader = csv.DictReader(open(f),delimiter=";")
for row in reader:
# date = row['date'] # date later
hour = row['hour']
val = row['values']
source = row['source']
if not sources.has_key(source): # new source
sources[source] = list()
#
sources[source].append({'hour':hour, 'sensor'+`sensor`:val})
sensor+=1
I'm not sure the data structure is good to sort. I also fell like repeating column name.
Using your data provided, I cooked up something using Pandas. Please see code below.
The output, granted, is non-ideal, as the hour
and source
get repeated within a column. As I am learning too, I'd also welcome any expert input on whether Pandas can do what the OP is asking for!
In [1]: import pandas as pd
In [2]: s1 = pd.read_csv('s1.csv', delimiter=';', parse_dates=True)
In [3]: s1
Out[3]:
date hour source values
0 01/25/12 10:20:00 a 88 -84 27
1 01/25/12 10:30:00 a -80
2 01/25/12 10:50:00 b -96 3 -88
3 01/25/12 09:00:00 b -97 101
4 01/25/12 09:10:00 c 28
In [4]: s2 = pd.read_csv('s2.csv', delimiter=';', parse_dates=True)
In [5]: s2
Out[5]:
date hour source values
0 01/25/12 10:20:00 a 133
1 01/25/12 10:25:00 a -8 -5
In [6]: joined = s1.append(s2)
In [7]: joined
Out[7]:
date hour source values
0 01/25/12 10:20:00 a 88 -84 27
1 01/25/12 10:30:00 a -80
2 01/25/12 10:50:00 b -96 3 -88
3 01/25/12 09:00:00 b -97 101
4 01/25/12 09:10:00 c 28
0 01/25/12 10:20:00 a 133
1 01/25/12 10:25:00 a -8 -5
In [8]: grouped = joined.groupby('hour').sum()
In [9]: grouped.to_csv('a.csv')
In [10]: grouped
Out[10]:
date source values
hour
09:00:00 01/25/12 b -97 101
09:10:00 01/25/12 c 28
10:20:00 01/25/1201/25/12 aa 88 -84 27 133
10:25:00 01/25/12 a -8 -5
10:30:00 01/25/12 a -80
10:50:00 01/25/12 b -96 3 -88
If I understand correctly, you have multiple files, each corresponding to a given "sensor", with the identity of the sensor in the filename. You want to read the files, then write them out in to separate files again, this time divided by "source", with the data from the different sensors combined into several final rows.
Here's what I think you want to do:
'a'
). (date, time)
tuple. Here's some code:
from collections import defaultdict
from datetime import datetime
import csv
import glob
import os
# data structure is data[source][date, time][sensor] = value, with "" as default value
data = defaultdict(lambda: defaultdict(lambda: defaultdict(str)))
sensors = []
filelist = glob.glob("*.csv")
# read old files
for fn in filelist:
sensor = os.path.splitext(fn)[0]
sensors.append(sensor)
with open(fn, 'rb') as f:
reader = csv.DictReader(f, delimiter=";")
for row in reader:
date = datetime.strptime(row['date'], '%m/%d/%y')
data[row['source']][date, row['hour']][sensor] = row['values']
sensors.sort() # note, this may not give the best sort order
header = ['date', 'hour', 'source'] + sensors
for source, source_data in data.iteritems():
fn = "{}.csv".format(source)
with open(fn, 'wb') as f:
writer = csv.writer(f, delimiter=";")
writer.writerow(header)
for (date, time), hour_data in sorted(source_data.items()):
values = [hour_data[sensor] for sensor in sensors]
writer.writerow([date.strftime('%m/%d/%y'), time, source] + values)
I only convert the date field to an internal type because otherwise sorting based on dates won't work correctly (dates in January 2013 would appear before those in February 2012). In the future, consider using ISO 8601 style date formating, YYYY-MM-DD
, which can be safely sorted as a string. The rest of the values are handled only as strings with no interpretation.
The code assumes that the sensor
values can be ordered lexicographically. This is likely if you only have a few of them, eg s1
and s2
. However, if you have a s10
, it will be sorted ahead of s2
. To solve this you'll need a "natural" sort, which is more complicated than I can solve here (but see this recent question for more info).
One final warning: This solution may do bad things if you run it mutliple times in the same folder. That's because the output files, eg a.csv
will be seen by glob.glob('*.csv')
as input files when you run again.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.