[英]How to split day, hour, minute and second data in a huge Pandas data frame?
I'm new to Python and I'm working on a project for a Data Science class I'm taking. 我是Python的新手,我正在为我正在参加的数据科学课设计一个项目。 I have a big csv file (around 190 million lines, approx. 7GB of data) and I need, first, to do some data preparation. 我有一个很大的csv文件(大约1.9亿行,大约7GB的数据),我首先需要做一些数据准备。
Full disclaimer: data here is from this Kaggle competition . 完全免责声明:此处的数据来自该Kaggle竞赛 。
A picture from Jupyter Notebook with headers follows. 下面是Jupyter Notebook中带有标题的图片。 Although it reads full_data.head()
, I'm using a 100,000-lines sample just to test code. 尽管它读取full_data.head()
,但我正在使用100,000行示例来测试代码。
The most important column is click_time
. 最重要的列是click_time
。 The format is: dd hh:mm:ss
. 格式为: dd hh:mm:ss
。 I want to split this in 4 different columns: day, hour, minute and second. 我想将其分为4个不同的列:天,小时,分钟和秒。 I've reached a solution that works fine with this little file but it takes too long to run on 10% of real data, let alone on top 100% of real data (hasn't even been able to try that since just reading the full csv is a big problem right now). 我已经找到了一种解决方案,可以使用这个小文件正常工作,但是要处理10%的真实数据要花很长时间,更不用说要对100%的真实数据运行(更不用说了,因为只读取了完整的csv现在是一个大问题)。
Here it is: 这里是:
# First I need to split the values
click = full_data['click_time']
del full_data['click_time']
click = click.str.replace(' ', ':')
click = click.str.split(':')
# Then I transform everything into integers. The last piece of code
# returns an array of lists, one for each line, and each list has 4
# elements. I couldn't figure out another way of making this conversion
click = click.apply(lambda x: list(map(int, x)))
# Now I transform everything into unidimensional arrays
day = np.zeros(len(click), dtype = 'uint8')
hour = np.zeros(len(click), dtype = 'uint8')
minute = np.zeros(len(click), dtype = 'uint8')
second = np.zeros(len(click), dtype = 'uint8')
for i in range(0, len(click)):
day[i] = click[i][0]
hour[i] = click[i][1]
minute[i] = click[i][2]
second[i] = click[i][3]
del click
# Transforming everything to a Pandas series
day = pd.Series(day, index = full_data.index, dtype = 'uint8')
hour = pd.Series(hour, index = full_data.index, dtype = 'uint8')
minute = pd.Series(minute, index = full_data.index, dtype = 'uint8')
second = pd.Series(second, index = full_data.index, dtype = 'uint8')
# Adding to data frame
full_data['day'] = day
del day
full_data['hour'] = hour
del hour
full_data['minute'] = minute
del minute
full_data['second'] = second
del second
The result is ok, it's what I want, but there has to be a faster way doing this: 结果还可以,这就是我想要的,但是必须有一种更快的方法:
Any ideas on how to improve this implementation? 关于如何改善此实现的任何想法? If one is interested in the dataset, this is from the test_sample.csv: https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/data 如果对数据集感兴趣,则来自test_sample.csv: https ://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/data
Thanks a lot in advance!! 在此先多谢!!
EDIT 1 : Following @COLDSPEED request, I provide the results of full_data.head.to_dict()
: 编辑1 :在@COLDSPEED请求之后,我提供了full_data.head.to_dict()
的结果:
{'app': {0: 12, 1: 25, 2: 12, 3: 13, 4: 12},
'channel': {0: 497, 1: 259, 2: 212, 3: 477, 4: 178},
'click_time': {0: '07 09:30:38',
1: '07 13:40:27',
2: '07 18:05:24',
3: '07 04:58:08',
4: '09 09:00:09'},
'device': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'ip': {0: 87540, 1: 105560, 2: 101424, 3: 94584, 4: 68413},
'is_attributed': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'os': {0: 13, 1: 17, 2: 19, 3: 13, 4: 1}}
Convert to timedelta
and extract components: 转换为timedelta
并提取组件:
v = df.click_time.str.split()
df['days'] = v.str[0].astype(int)
df[['hours', 'minutes', 'seconds']] = (
pd.to_timedelta(v.str[-1]).dt.components.iloc[:, 1:4]
)
df
app channel click_time device ip is_attributed os days hours \
0 12 497 07 09:30:38 1 87540 0 13 7 9
1 25 259 07 13:40:27 1 105560 0 17 7 13
2 12 212 07 18:05:24 1 101424 0 19 7 18
3 13 477 07 04:58:08 1 94584 0 13 7 4
4 12 178 09 09:00:09 1 68413 0 1 9 9
minutes seconds
0 30 38
1 40 27
2 5 24
3 58 8
4 0 9
One solution is to first split by whitespace, then convert to datetime
objects, then extract components directly. 一种解决方案是先按空格分割,然后转换为datetime
对象,然后直接提取组件。
import pandas as pd
df = pd.DataFrame({'click_time': ['07 09:30:38', '07 13:40:27', '07 18:05:24',
'07 04:58:08', '09 09:00:09', '09 01:22:13',
'09 01:17:58', '07 10:01:53', '08 09:35:17',
'08 12:35:26']})
df[['day', 'time']] = df['click_time'].str.split().apply(pd.Series)
df['datetime'] = pd.to_datetime(df['time'])
df['day'] = df['day'].astype(int)
df['hour'] = df['datetime'].dt.hour
df['minute'] = df['datetime'].dt.minute
df['second'] = df['datetime'].dt.second
df = df.drop(['time', 'datetime'], 1)
Result 结果
click_time day hour minute second
0 07 09:30:38 7 9 30 38
1 07 13:40:27 7 13 40 27
2 07 18:05:24 7 18 5 24
3 07 04:58:08 7 4 58 8
4 09 09:00:09 9 9 0 9
5 09 01:22:13 9 1 22 13
6 09 01:17:58 9 1 17 58
7 07 10:01:53 7 10 1 53
8 08 09:35:17 8 9 35 17
9 08 12:35:26 8 12 35 26
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.