简体   繁体   中英

filter hourly data python

I have different sets of data where some data is in 5 minute/15 minute or 30 minute interval. There are 100s of such files (in different formats - .dat, .txt, .csv etc.) I would like to filter out hourly data from all these files using Python. I am new to using Pandas and while I am trying to learn the library, any help would be much appreaciated.

Date        Time    Point_1
27/3/2017   0:00:00 13.08
27/3/2017   0:05:00 12.96
27/3/2017   0:10:00 13.3
27/3/2017   0:15:00 13.27
27/3/2017   0:20:00 13.15
27/3/2017   0:25:00 13.14
27/3/2017   0:30:00 13.25
27/3/2017   0:35:00 13.26
27/3/2017   0:40:00 13.24
27/3/2017   0:45:00 13.43
27/3/2017   0:50:00 13.23
27/3/2017   0:55:00 13.27
27/3/2017   1:00:00 13.19
27/3/2017   1:05:00 13.17
27/3/2017   1:10:00 13.1
27/3/2017   1:15:00 13.06
27/3/2017   1:20:00 12.99
27/3/2017   1:25:00 13.08
27/3/2017   1:30:00 13.04
27/3/2017   1:35:00 13.06
27/3/2017   1:40:00 13.07
27/3/2017   1:45:00 13.07
27/3/2017   1:50:00 13.02
27/3/2017   1:55:00 13.13
27/3/2017   2:00:00 12.99

You can use read_csv with parameter parse_dates for convert columns date and time to datetime first:

import pandas as pd
from pandas.compat import StringIO

temp=u"""Date        Time    Point_1
27/3/2017   0:00:00 13.08
27/3/2017   0:05:00 12.96
27/3/2017   0:10:00 13.3
27/3/2017   0:15:00 13.27
27/3/2017   0:20:00 13.15
27/3/2017   0:25:00 13.14
27/3/2017   0:30:00 13.25
27/3/2017   0:35:00 13.26
27/3/2017   0:40:00 13.24
27/3/2017   0:45:00 13.43
27/3/2017   0:50:00 13.23
27/3/2017   0:55:00 13.27
27/3/2017   1:00:00 13.19
27/3/2017   1:05:00 13.17
27/3/2017   1:10:00 13.1
27/3/2017   1:15:00 13.06
27/3/2017   1:20:00 12.99
27/3/2017   1:25:00 13.08
27/3/2017   1:30:00 13.04
27/3/2017   1:35:00 13.06
27/3/2017   1:40:00 13.07
27/3/2017   1:45:00 13.07
27/3/2017   1:50:00 13.02
27/3/2017   1:55:00 13.13
27/3/2017   2:00:00 12.99"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), 
                sep="\s+", #alternatively delim_whitespace=True
                index_col=[0], 
                parse_dates={'Dates':['Date','Time']})

Then resample and aggregate first or sum , mean ...:

df1 = df.resample('1H')['Point_1'].first().reset_index()
print (df1)
                Dates  Point_1
0 2017-03-27 00:00:00    13.08
1 2017-03-27 01:00:00    13.19
2 2017-03-27 02:00:00    12.99
df1 = df.resample('1H')['Point_1'].sum().reset_index()
print (df1)
                Dates  Point_1
0 2017-03-27 00:00:00   158.58
1 2017-03-27 01:00:00   156.98
2 2017-03-27 02:00:00    12.99

Another solution with groupby and Grouper :

df1 = df.groupby(pd.Grouper(freq='1H')).first().reset_index()
print (df1)
                Dates  Point_1
0 2017-03-27 00:00:00    13.08
1 2017-03-27 01:00:00    13.19
2 2017-03-27 02:00:00    12.99

Or maybe need:

df = pd.read_csv(StringIO(temp),delim_whitespace=True, parse_dates={'Dates':['Date','Time']})

mask = df.Dates.dt.round('H').ne(df.Dates)
df1 = df[mask]
print (df1)
                 Dates  Point_1
1  2017-03-27 00:05:00    12.96
2  2017-03-27 00:10:00    13.30
3  2017-03-27 00:15:00    13.27
4  2017-03-27 00:20:00    13.15
5  2017-03-27 00:25:00    13.14
6  2017-03-27 00:30:00    13.25
7  2017-03-27 00:35:00    13.26
8  2017-03-27 00:40:00    13.24
9  2017-03-27 00:45:00    13.43
10 2017-03-27 00:50:00    13.23
11 2017-03-27 00:55:00    13.27
13 2017-03-27 01:05:00    13.17
14 2017-03-27 01:10:00    13.10
15 2017-03-27 01:15:00    13.06
16 2017-03-27 01:20:00    12.99
17 2017-03-27 01:25:00    13.08
18 2017-03-27 01:30:00    13.04
19 2017-03-27 01:35:00    13.06
20 2017-03-27 01:40:00    13.07
21 2017-03-27 01:45:00    13.07
22 2017-03-27 01:50:00    13.02
23 2017-03-27 01:55:00    13.13
import pandas as pd

df = pd.read_table('sample.txt', delimiter='\s+')  # Your sample data
df['dt'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])

print df.set_index('dt').resample('1H').asfreq().reset_index(drop=True)


        Date     Time  Point_1
0  27/3/2017  0:00:00    13.08
1  27/3/2017  1:00:00    13.19
2  27/3/2017  2:00:00    12.99

This is a similar to what you are tying to do. This works for csv files and should also work for your .txt file as well. If all of the data is in the same order you can very easily write a for loop to increment a count and when it hits 13 out put that value into xaxis list. However if your data doesn't follow the same pattern as increasing in 5 min increments you will need to sort it by another metric to save you a headache down the road. This is easily done using pythons sort function in matplotlib though. https://docs.scipy.org/doc/numpy/reference/generated/numpy.sort.html

#opens the file and reads in the raw data and 
#cleans up data so it is readable
file=open("file_name","r")
data=file.read()
data=data.replace(" ",",")
#when reading in the data the 3rd index saved a 
#value \r so this is necessary to use a float type
data=data.split("\r")
#x and y axis dictionary
xaxis = []
#for loop for getting the time and 
for index in range(0,len(data)):
 xaxis=data[index][0]
#if data is in range remove data that has a divide by 0 error
for index in range(0, len(data)):
  if len(data[index]) == 0:
    del(data[index])
    continue
for index in range(0,len(data)):
print "lines",index, "-",data[index]
data[index]=data[index].split(",")
data[index][1]=int(data[index][1])

Thanks All !!

Here is my complete code to read all files from all folders and write filtered data (hourly only) to new csv files. I don't code that often so my programming skills are not that great. I am sure there is a better way of doing the same thing and I am not talking only about pandas library but rather the whole code in general. I wish I could replace my if loop with something better. This is mainly to prevent list going out of index (something like k=k-1, but not sure where to put it.) My code is working smoothly. If there are any enthusiast for make better, please join in !

My folder structure is like: Building1 is the master folder which contains 20 subfolders and each subfolder contains 19-20 files.

Cheers

import os
import pandas as pd
folderarray = []
filearray =[]
patharray =[]

path = "C:\Users\Priyanka\Documents\R_Python\OneHourInterval\Building1"
os.chdir(path)


for foldername in os.listdir(os.getcwd()):
    folderarray.append(foldername)
    print folderarray

for i in range(0,len(folderarray)):
    filename = os.listdir(path+"\\"+folderarray[i])
    filearray.append(filename)

for j in range(0,len(folderarray)):
    for k in range(0,len(filearray)):
        if k < len(filearray[j]):
            df1 = pd.read_csv(path+"""\\"""+folderarray[j]+"""\\"""+filearray[j][k], sep=",", header=None)
            df = df1[2:len(df1)]
            df = df[[0,1,2,3,4,5]]
            df.columns = ['Date','Time','KWH','OCT','RAT','CO2']
            dftime = pd.to_datetime(df['Time'])    
            df['dt'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
            df = df.set_index('dt').resample('1H')['KWH','OCT','RAT','CO2'].first().reset_index()
            print df
            print path+"""\\"""+folderarray[j]+"""\\"""+filearray[j][k]
            str = filearray[j][k]
            newfilename = str.replace(".dat",".csv")
            df.to_csv(path+"""\\"""+folderarray[j]+"""\\"""+newfilename)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM