简体   繁体   中英

Find the hour in the H2O frame

I am trying to find the hour in a column which has the format of "hhmmss" ie "90205" where 9 indicates the hour. Some rows may not include seconds so it can be "902" and I need to still get the "9". An example of the column is as follows:

REQ_TIME 195426 508 140315 141432 203344 214103 63202 101807 110730 115052

I can do this in a regular dataframe as such:

df["DATE"]=pd.to_datetime(df.REQ_DATE, format='%Y%m%d')
df["TIME"]=df["REQ_TIME"].apply(lambda x: str(x).zfill(6))
df['DATE_TIME']=df[['REQ_DATE','TIME']].apply(lambda x : '{} {}'.format(x[0],x[1]), axis=1)
df['DATE_TIME']=pd.to_datetime(df.DATE_TIME,infer_datetime_format=True)
df["HOUR"]=df.DATE_TIME.dt.hour
df['YEAR'] = df.DATE.dt.year
df['MONTH'] = df.DATE.dt.month
df['DAY'] = df.DATE.dt.day
df['DAY_OF_WEEK']=df.DATE.dt.dayofweek

But my data is in an H2OFrame so I am not able to use regular python methods. I do not want to convert it to dataframe as well since it takes a long time. How can I do this in an H2OFrame?

If your REQ_TIME field was always 6 digits, ie was always zero-padded left and right, this becomes much easier. Eg you could use gsub to just take the first two characters.

Or if it was always zero-padded on the right (ie "00" seconds appended when missing) and it was imported as a numeric field, you could divide by 10000, and use floor .

(See http://h2o-release.s3.amazonaws.com/h2o/rel-turchin/5/docs-website/h2o-py/docs/frame.html for the operations available on H2OFrames, from python API.)

But in your case, I'd download that column, do the complex manipulation in python, then import a new H2O Frame containing just that column. Give it a column name of "hours" . Then use cbind to join your new column to your existing h2o frame.

(Another way to view this problem is that the first line of your question is inaccurate, as it is not "hhmmss" format, but is in fact a mix of "hmm", "hhmm", "hmmss" and "hhmmss" all mixed together in one column. Once you describe it like that, you see you have a data problem. Personally I'd look into the effort to get that fixed at the point of data collection. Then, going forward, if you ever see a timestamp that is not exactly 6 digits, you immediately know you have bad data.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM