简体   繁体   English

数据框的多个计数和中值

[英]Multiple Count and Median Values from a Dataframe

I am trying to perform several operations in one program at same time. 我试图在一个程序中同时执行多项操作。 I have a data-frame that has Dates of which I have no clue of start and end and I want to find: 我有一个具有Dates的数据框,但没有开始和结束的线索,我想找到:

  1. Total number of days the data-set has 数据集拥有的总天数
  2. Total number of hours 总小时数
  3. Median of the Count 计数中位数
  4. Write a separate output for median per day/date. 为每天/日期的中位数编写一个单独的输出。
  5. If possible Median-of-Median in most possible simple way. 如果可能,以最简单的方式中位数。

Input: Few rows from the a large file of GB size 输入:GB大文件中的几行

2004-01-05,16:00:00,17:00:00,Mon,10766,656
2004-01-05,17:00:00,18:00:00,Mon,12223,670
2004-01-05,18:00:00,19:00:00,Mon,12646,710
2004-01-05,19:00:00,20:00:00,Mon,19269,778
2004-01-05,20:00:00,21:00:00,Mon,20504,792
2004-01-05,21:00:00,22:00:00,Mon,16553,783
2004-01-05,22:00:00,23:00:00,Mon,18944,790
2004-01-05,23:00:00,00:00:00,Mon,17534,750
2004-01-06,00:00:00,01:00:00,Tue,17262,747
2004-01-06,01:00:00,02:00:00,Tue,19072,777
2004-01-06,02:00:00,03:00:00,Tue,18275,785
2004-01-06,03:00:00,04:00:00,Tue,13589,757
2004-01-06,04:00:00,05:00:00,Tue,16053,735

The start and end date are NOT known. 开始日期和结束日期未知。

Edit: Expected Output:1 will have only one row of results 编辑:预期输出:1将只有一行结果

days,hours,median,median-of-median
2,17262,13,17398

Median-of-Median is the median value of median column from output 2 中位数是输出2 median列的median

Expected Output:2, will have medians of every date which are to used to find median-of-median 预期输出:2,将具有每个日期的中位数,用于查找中位数

date,median
2004-01-05,17534
2004-01-06,17262

Code: 码:

import pandas as pd 
from datetime import datetime

df = pd.read_csv('one_hour.csv')
df.columns = ['date', 'startTime', 'endTime', 'day', 'count', 'unique']

date_count = df.count(['date'])
all_median = df.median(['count'])
all_hours = df.count(['startTime'])
med_med = df.groupby(['date','count']).median()

print date_count
print all_median
print all_hours

stats = ['date_count', 'all_median', 'all_hours', 'median-of-median']
stats.to_csv('stats_all.csv', index=False)

med_med.to_csv('med_day.csv', index=False, header=False)

Obviously the code does not give the result as it is supposed to. 显然,代码没有给出应有的结果。

The error is shown below. 错误如下所示。

Error: 错误:

Traceback (most recent call last):
  File "day_median.py", line 8, in <module>
    all_median = df.median(['count'])
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 5310, in stat_func
    numeric_only=numeric_only)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 4760, in _reduce
    axis = self._get_axis_number(axis)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 308, in _get_axis_number
    axis = self._AXIS_ALIASES.get(axis, axis)
TypeError: unhashable type: 'list'

IIUC maybe help change: IIUC可能有助于改变:

date_count = df.count(['date'])
all_median = df.median(['count'])
all_hours = df.count(['startTime'])

to: 至:

date_count = df['date'].count()
all_median = df['count'].median()
all_hours = df['startTime'].count()

print (date_count)
print (all_median)
print (all_hours)
13
17262.0
13

if need count statistics from columns date , count and startTime . 如果需要从datecountstartTime列进行计数统计。

EDIT by comment: 通过评论编辑:

If need count unique values of column use nunique : 如果需要计算列的唯一值,请使用nunique

date_count = df['date'].nunique()
print (date_count)
2

DataFrame stats : DataFrame stats

cols = ['date_count', 'all_median', 'all_hours']
stats = pd.DataFrame([[date_count, all_median, all_hours]], columns = cols)
print (stats)
   date_count  all_median  all_hours
0           2     17262.0         13

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM