Python multiple file csv sum column, average for the week and average for the branch

Question

I'm new in Python and on directory i have two csv file

file1.csv

Id place,Date and hour, Value
1,2018.09.17.12.54,200000
2,2018.09.18.14.16,150000
1,2018.09.19.15.06,78000
3,2018.09.17.16.26,110000
2,2018.09.20.13.54,200000
3,2018.09.17.14.16,150000
1,2018.09.21.12.54,200000

file2.csv

Id place,Date and hour, Value
1,2018.09.24.12.54,200000
3,2018.09.24.14.16,150000
1,2018.09.24.15.06,78000
2,2018.09.26.16.26,110000
1,2018.09.27.12.54,200000
3,2018.09.25.14.16,150000
1,2018.09.28.12.54,200000
3,2018.09.28.14.16,150000

I have read all files csv in direcory and save in new csv file information about

sum column Value from files

output

Id place, Value
1, 1 156 000
2, 460 000
3, 710 000

average sum in weeks

output

Week, average Value
1 ,  155428,57   (1088000 / 7)
2 ,  154750   (1238000 / 8)

average sum in weeks in place

output

Id place,Week, average Value
1, 1 , 159 333  (478000 / 3)
2, 1 , 175 000  (350000 / 2)
3, 1 , 130 000  (260 000/ 2)
1, 2 , 169 500  (678000 / 4) 
2, 2 , 110 000  (110000 / 1)
3, 2 , 150 000  (450000 / 3)

I have no idea how to do it thanks in advance

Answer 1

I suggest use pandas :

import glob
import pandas as pd

#get all files
files = glob.glob('files/*.csv')
#create list of DataFrames, if necessary remove traling whitespaces in csv headers
dfs = [pd.read_csv(fp).rename(columns=lambda x: x.strip()) for fp in files]
#join together all files
df = pd.concat(dfs, ignore_index=True)

#convert column to datetimes
df['Date and hour'] = pd.to_datetime(df['Date and hour'], format='%Y.%m.%d.%H.%M')
#convert to weeks and for starting with 1 add factorize
df['week'] = pd.factorize(df['Date and hour'].dt.weekofyear)[0] + 1
print (df)
    Id place       Date and hour   Value  week
0          1 2018-09-17 12:54:00  200000     1
1          2 2018-09-18 14:16:00  150000     1
2          1 2018-09-19 15:06:00   78000     1
3          3 2018-09-17 16:26:00  110000     1
4          2 2018-09-20 13:54:00  200000     1
5          3 2018-09-17 14:16:00  150000     1
6          1 2018-09-21 12:54:00  200000     1
7          1 2018-09-24 12:54:00  200000     2
8          3 2018-09-24 14:16:00  150000     2
9          1 2018-09-24 15:06:00   78000     2
10         2 2018-09-26 16:26:00  110000     2
11         1 2018-09-27 12:54:00  200000     2
12         3 2018-09-25 14:16:00  150000     2
13         1 2018-09-28 12:54:00  200000     2
14         3 2018-09-28 14:16:00  150000     2

#aggregate sum
df1 = df.groupby('Id place', as_index=False)['Value'].sum()
print (df1)
   Id place    Value
0         1  1156000
1         2   460000
2         3   710000

#aggregate mean
df2 = df.groupby('week', as_index=False)['Value'].mean()
print (df2)
   week          Value
0     1  155428.571429
1     2  154750.000000

#aggregate mean per 2 columns
df3 = df.groupby(['Id place','week'], as_index=False)['Value'].mean()
print (df3)

   Id place  week          Value
0         1     1  159333.333333
1         1     2  169500.000000
2         2     1  175000.000000
3         2     2  110000.000000
4         3     1  130000.000000
5         3     2  150000.000000

#write output DataFrames to files
df1.to_csv('out1.csv', index=False)
df2.to_csv('out2.csv', index=False)
df3.to_csv('out3.csv', index=False)

Answer 2

Definetly not recommended, and pandas is by far the better approach, but the manual way of doing this would be to use defaultdicts to group your items and perform calculations with them at the end.

Demo:

from csv import reader
from os import listdir
from collections import defaultdict
from datetime import datetime
from operator import itemgetter
from pprint import pprint

# Collect sums first in a defaultdict
sums = defaultdict(list)

# Collect dates seperately since they are more complicated
dates = []

# Get all csv files and open them
for file in listdir("."):
    if file.endswith(".csv"):
        with open(file) as f:
            csv_reader = reader(f)

            # Skip headers
            next(csv_reader)

            # Separately get sums and dates stuff
            for place, date, value in csv_reader:
                sums[int(place)].append(int(value))
                dates.append(
                    (place, datetime.strptime(date, "%Y.%m.%d.%H.%M"), int(value))
                )

# Print out sum of columns
sum_column_values = {k: sum(v) for k, v in sums.items()}
pprint(sum_column_values)

# Get Minimum date to get weeknumber
min_date = min(map(itemgetter(1), dates)).date().isocalendar()[1]

# Collect weeks stuff in separate dicts
weeks = defaultdict(list)
place_weeks = defaultdict(list)

for place, date, value in dates:

    # Weeknumber calculation
    week_number = date.date().isocalendar()[1] - min_date + 1

    # Collect week stuff
    weeks[week_number].append(value)
    place_weeks[int(place), week_number].append(value)

# Print out week averages
week_averages = {k: sum(v) / len(v) for k, v in weeks.items()}
pprint(week_averages)

# Print out place/week averages
place_week_averages = {k: sum(v) / len(v) for k, v in place_weeks.items()}
pprint(place_week_averages)

Which give the following results stored in separate dictionaries:

# place averages
{1: 1156000, 2: 460000, 3: 710000}

# week averages
{1: 155428.57142857142, 2: 154750.0}

# place/week averages
{(1, 1): 159333.33333333334,
 (1, 2): 169500.0,
 (2, 1): 175000.0,
 (2, 2): 110000.0,
 (3, 1): 130000.0,
 (3, 2): 150000.0}

Python multiple file csv sum column, average for the week and average for the branch

Question

2 answers

solution1
3 2018-12-04 13:49:28

solution2
1 ACCPTED 2018-12-04 14:36:13

Python multiple file csv sum column, average for the week and average for the branch

Question

2 answers

solution1 3 2018-12-04 13:49:28

solution2 1 ACCPTED 2018-12-04 14:36:13

solution1
3 2018-12-04 13:49:28

solution2
1 ACCPTED 2018-12-04 14:36:13