简体   繁体   中英

Limit on Pandas groupby count

I have DataFrame crimes_df :

>> crimes_df.size
6198374

I need to calculate events with the same "s_lat" , "s_lon" and "date" . I use groupby:

crimes_count_df = crimes_df\
    .groupby(["s_lat", "s_lon", "date"])\
    .size()\
    .to_frame("crimes")

But it doesn't give the correct answer because if you calculate sum you can see that most events were lost:

>> crimes_count_df.sum()
crimes    476798
dtype: int64

I've also tried agg:

crimes_count_df = crimes_df\
    .groupby(["s_lat", "s_lon", "date"])\
    .agg(['count'])

But the same result:

crimes_count_df.sum()
Unnamed: 0            count    476798
area                  count    476798
arrest                count    476798
description           count    476798
domestic              count    476798
latitude              count    476798
location_description  count    475712
longitude             count    476798
time                  count    476798
type                  count    476798

EDIT: I found out there is a limit on this aggregation functions! See this commands:

crimes_df.head(100) \
    .groupby(["s_lat", "s_lon", "date"]) \
    .size() \
    .to_frame("crimes")\
    .sum()
crimes    100
dtype: int64

crimes_df.head(1000) \
    .groupby(["s_lat", "s_lon", "date"]) \
    .size() \
    .to_frame("crimes")\
    .sum()
crimes    1000
dtype: int64

crimes_df.head(10000) \
    .groupby(["s_lat", "s_lon", "date"]) \
    .size() \
    .to_frame("crimes")\
    .sum()
crimes    10000
dtype: int64

crimes_df.head(100000) \
    .groupby(["s_lat", "s_lon", "date"]) \
    .size() \
    .to_frame("crimes")\
    .sum()
crimes    100000
dtype: int64

crimes_df.head(1000000) \
    .groupby(["s_lat", "s_lon", "date"]) \
    .size() \
    .to_frame("crimes")\
    .sum()
crimes    476798
dtype: int64

crimes_df.head(10000000) \
    .groupby(["s_lat", "s_lon", "date"]) \
    .size() \
    .to_frame("crimes")\
    .sum()
crimes    476798
dtype: int64

crimes_df.head(476799) \
    .groupby(["s_lat", "s_lon", "date"]) \
    .size() \
    .to_frame("crimes")\
    .sum()
crimes    476798
dtype: int64

If you want to check it yourself, here is file with data:

https://www.dropbox.com/s/ib0kq16t4c2e5a2/CrimeDataWithSquare.csv?dl=0

You can load it this way:

from pandas import read_csv, DataFrame
crimes_df = read_csv("CrimeDataWithSquare.csv")

Info

crimes_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 476798 entries, 0 to 476797
Data columns (total 13 columns):
Unnamed: 0              476798 non-null int64
area                    476798 non-null float64
arrest                  476798 non-null bool
date                    476798 non-null object
description             476798 non-null object
domestic                476798 non-null bool
latitude                476798 non-null float64
location_description    475712 non-null object
longitude               476798 non-null float64
time                    476798 non-null object
type                    476798 non-null object
s_lon                   476798 non-null float64
s_lat                   476798 non-null float64
dtypes: bool(2), float64(5), int64(1), object(5)
memory usage: 40.9+ MB

I think it's not a bug. The size method is not always equals to number of rows. Let's look your case:

import pandas as pd

crimes_df = pd.read_csv("CrimeDataWithSquare.csv")

crimes_df.shape

#(476798, 13)

crimes_df.shape[0] * crimes_df.shape[1]

#6198374

crimes_df.size

#6198374

len(crimes_df)

#476798

What documentation says about the size method?

number of elements in the NDFrame

Generally, Dataframe has 2 dimensions (X rows by Y columns). Thus, the dataframe size method returns X times Y (number of elements in it).

What if you have one single column?

crimes_df2 = crimes_df.iloc[:, 0]
len(crimes_df2) == crimes_df2.size

#True

It's the result you were expecting for.

Try this:

np.random.seed(0)
df = pd.DataFrame({
    'a': [1, 2, 3] * 4,
    'b': np.random.choice(['q','w','a'], size=12),
    'c': 1
})

df
    a  b  c
0   1  q  1
1   2  w  1
2   3  q  1
3   1  w  1
4   2  w  1
5   3  a  1
6   1  q  1
7   2  a  1
8   3  q  1
9   1  q  1
10  2  q  1
11  3  a  1

df.groupby(['a', 'b']).count()

     c
a b   
1 q  3
  w  1
2 a  1
  q  1
  w  2
3 a  2
  q  2

Is it possible that some of your dataset could have missing values, such as date? If I recall correctly, a None is not going to group (though I could be wrong). Have you tried using fillna(0)?

crimes_count_df = crimes_df\
    .groupby(["s_lat", "s_lon", "date"])\
    .size()\
    .reset_index()\
    .fillna(0)\
    .to_frame("crimes")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM