I have different sets of data where some data is in 5 minute/15 minute or 30 minute interval. There are 100s of such files (in different formats - .dat, .txt, .csv etc.) I would like to filter out hourly data from all these files using Python. I am new to using Pandas and while I am trying to learn the library, any help would be much appreaciated.
Date Time Point_1
27/3/2017 0:00:00 13.08
27/3/2017 0:05:00 12.96
27/3/2017 0:10:00 13.3
27/3/2017 0:15:00 13.27
27/3/2017 0:20:00 13.15
27/3/2017 0:25:00 13.14
27/3/2017 0:30:00 13.25
27/3/2017 0:35:00 13.26
27/3/2017 0:40:00 13.24
27/3/2017 0:45:00 13.43
27/3/2017 0:50:00 13.23
27/3/2017 0:55:00 13.27
27/3/2017 1:00:00 13.19
27/3/2017 1:05:00 13.17
27/3/2017 1:10:00 13.1
27/3/2017 1:15:00 13.06
27/3/2017 1:20:00 12.99
27/3/2017 1:25:00 13.08
27/3/2017 1:30:00 13.04
27/3/2017 1:35:00 13.06
27/3/2017 1:40:00 13.07
27/3/2017 1:45:00 13.07
27/3/2017 1:50:00 13.02
27/3/2017 1:55:00 13.13
27/3/2017 2:00:00 12.99
You can use read_csv
with parameter parse_dates
for convert columns date
and time
to datetime
first:
import pandas as pd
from pandas.compat import StringIO
temp=u"""Date Time Point_1
27/3/2017 0:00:00 13.08
27/3/2017 0:05:00 12.96
27/3/2017 0:10:00 13.3
27/3/2017 0:15:00 13.27
27/3/2017 0:20:00 13.15
27/3/2017 0:25:00 13.14
27/3/2017 0:30:00 13.25
27/3/2017 0:35:00 13.26
27/3/2017 0:40:00 13.24
27/3/2017 0:45:00 13.43
27/3/2017 0:50:00 13.23
27/3/2017 0:55:00 13.27
27/3/2017 1:00:00 13.19
27/3/2017 1:05:00 13.17
27/3/2017 1:10:00 13.1
27/3/2017 1:15:00 13.06
27/3/2017 1:20:00 12.99
27/3/2017 1:25:00 13.08
27/3/2017 1:30:00 13.04
27/3/2017 1:35:00 13.06
27/3/2017 1:40:00 13.07
27/3/2017 1:45:00 13.07
27/3/2017 1:50:00 13.02
27/3/2017 1:55:00 13.13
27/3/2017 2:00:00 12.99"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp),
sep="\s+", #alternatively delim_whitespace=True
index_col=[0],
parse_dates={'Dates':['Date','Time']})
Then resample
and aggregate first
or sum
, mean
...:
df1 = df.resample('1H')['Point_1'].first().reset_index()
print (df1)
Dates Point_1
0 2017-03-27 00:00:00 13.08
1 2017-03-27 01:00:00 13.19
2 2017-03-27 02:00:00 12.99
df1 = df.resample('1H')['Point_1'].sum().reset_index()
print (df1)
Dates Point_1
0 2017-03-27 00:00:00 158.58
1 2017-03-27 01:00:00 156.98
2 2017-03-27 02:00:00 12.99
Another solution with groupby
and Grouper
:
df1 = df.groupby(pd.Grouper(freq='1H')).first().reset_index()
print (df1)
Dates Point_1
0 2017-03-27 00:00:00 13.08
1 2017-03-27 01:00:00 13.19
2 2017-03-27 02:00:00 12.99
Or maybe need:
df = pd.read_csv(StringIO(temp),delim_whitespace=True, parse_dates={'Dates':['Date','Time']})
mask = df.Dates.dt.round('H').ne(df.Dates)
df1 = df[mask]
print (df1)
Dates Point_1
1 2017-03-27 00:05:00 12.96
2 2017-03-27 00:10:00 13.30
3 2017-03-27 00:15:00 13.27
4 2017-03-27 00:20:00 13.15
5 2017-03-27 00:25:00 13.14
6 2017-03-27 00:30:00 13.25
7 2017-03-27 00:35:00 13.26
8 2017-03-27 00:40:00 13.24
9 2017-03-27 00:45:00 13.43
10 2017-03-27 00:50:00 13.23
11 2017-03-27 00:55:00 13.27
13 2017-03-27 01:05:00 13.17
14 2017-03-27 01:10:00 13.10
15 2017-03-27 01:15:00 13.06
16 2017-03-27 01:20:00 12.99
17 2017-03-27 01:25:00 13.08
18 2017-03-27 01:30:00 13.04
19 2017-03-27 01:35:00 13.06
20 2017-03-27 01:40:00 13.07
21 2017-03-27 01:45:00 13.07
22 2017-03-27 01:50:00 13.02
23 2017-03-27 01:55:00 13.13
import pandas as pd
df = pd.read_table('sample.txt', delimiter='\s+') # Your sample data
df['dt'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
print df.set_index('dt').resample('1H').asfreq().reset_index(drop=True)
Date Time Point_1
0 27/3/2017 0:00:00 13.08
1 27/3/2017 1:00:00 13.19
2 27/3/2017 2:00:00 12.99
This is a similar to what you are tying to do. This works for csv files and should also work for your .txt file as well. If all of the data is in the same order you can very easily write a for loop to increment a count and when it hits 13 out put that value into xaxis list. However if your data doesn't follow the same pattern as increasing in 5 min increments you will need to sort it by another metric to save you a headache down the road. This is easily done using pythons sort function in matplotlib though. https://docs.scipy.org/doc/numpy/reference/generated/numpy.sort.html
#opens the file and reads in the raw data and
#cleans up data so it is readable
file=open("file_name","r")
data=file.read()
data=data.replace(" ",",")
#when reading in the data the 3rd index saved a
#value \r so this is necessary to use a float type
data=data.split("\r")
#x and y axis dictionary
xaxis = []
#for loop for getting the time and
for index in range(0,len(data)):
xaxis=data[index][0]
#if data is in range remove data that has a divide by 0 error
for index in range(0, len(data)):
if len(data[index]) == 0:
del(data[index])
continue
for index in range(0,len(data)):
print "lines",index, "-",data[index]
data[index]=data[index].split(",")
data[index][1]=int(data[index][1])
Thanks All !!
Here is my complete code to read all files from all folders and write filtered data (hourly only) to new csv files. I don't code that often so my programming skills are not that great. I am sure there is a better way of doing the same thing and I am not talking only about pandas library but rather the whole code in general. I wish I could replace my if loop with something better. This is mainly to prevent list going out of index (something like k=k-1, but not sure where to put it.) My code is working smoothly. If there are any enthusiast for make better, please join in !
My folder structure is like: Building1 is the master folder which contains 20 subfolders and each subfolder contains 19-20 files.
Cheers
import os
import pandas as pd
folderarray = []
filearray =[]
patharray =[]
path = "C:\Users\Priyanka\Documents\R_Python\OneHourInterval\Building1"
os.chdir(path)
for foldername in os.listdir(os.getcwd()):
folderarray.append(foldername)
print folderarray
for i in range(0,len(folderarray)):
filename = os.listdir(path+"\\"+folderarray[i])
filearray.append(filename)
for j in range(0,len(folderarray)):
for k in range(0,len(filearray)):
if k < len(filearray[j]):
df1 = pd.read_csv(path+"""\\"""+folderarray[j]+"""\\"""+filearray[j][k], sep=",", header=None)
df = df1[2:len(df1)]
df = df[[0,1,2,3,4,5]]
df.columns = ['Date','Time','KWH','OCT','RAT','CO2']
dftime = pd.to_datetime(df['Time'])
df['dt'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
df = df.set_index('dt').resample('1H')['KWH','OCT','RAT','CO2'].first().reset_index()
print df
print path+"""\\"""+folderarray[j]+"""\\"""+filearray[j][k]
str = filearray[j][k]
newfilename = str.replace(".dat",".csv")
df.to_csv(path+"""\\"""+folderarray[j]+"""\\"""+newfilename)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.