简体   繁体   English

熊猫 - 在每日人口普查数据中查找唯一条目

[英]Pandas - Finding Unique Entries in Daily Census Data

I have census data that looks like this for a full month and I want to find out how many unique inmates there were for the month. 我有一整个月的人口普查数据,我想找出这个月有多少独特的囚犯。 The information is taken daily so there are multiples. 这些信息每天都有,所以有倍数。

  _id,Date,Gender,Race,Age at Booking,Current Age
    1,2016-06-01,M,W,32,33
    2,2016-06-01,M,B,25,27
    3,2016-06-01,M,W,31,33

My method now is to group them by day and then add the ones that are not accounted for into the DataFrame. 我现在的方法是按天将它们分组,然后将未考虑的那些添加到DataFrame中。 My question is how to account for two people with the same info. 我的问题是如何使用相同的信息来说明两个人。 They would both get not added to the new DataFrame because one of them already exists? 它们都不会被添加到新的DataFrame,因为其中一个已经存在? I'm trying to figure out how many people total were in the prison during this time. 我想弄清楚在这段时间里监狱里有多少人。

_id is incremental, for example here is some data from the second day _id是增量的,例如这里是第二天的一些数据

2323,2016-06-02,M,B,20,21
2324,2016-06-02,M,B,44,45
2325,2016-06-02,M,B,22,22
2326,2016-06-02,M,B,38,39

link to the dataset here: https://data.wprdc.org/dataset/allegheny-county-jail-daily-census 链接到此处的数据集: https//data.wprdc.org/dataset/allegheny-county-jail-daily-census

You could use the df.drop_duplicates() which will return the DataFrame with only unique values, then count the entries. 您可以使用df.drop_duplicates() ,它将返回仅具有唯一值的DataFrame,然后计算条目。

Something like this should work: 这样的事情应该有效:

import pandas as pd
df = pd.read_csv('inmates_062016.csv', index_col=0, parse_dates=True)

uniqueDF = df.drop_duplicates()
countUniques = len(uniqueDF.index)
print(countUniques)

Result: 结果:

>> 11845

Pandas drop_duplicates Documentation Pandas drop_duplicates文档

Inmates June 2016 CSV 囚犯2016年6月CSV

The problem with this approach / data is that there could be many individual inmates that are the same age / gender / race that would be filtered out. 这种方法/数据的问题在于可能有许多个体囚犯,他们的年龄/性别/种族相同,将被过滤掉。

I think the trick here is to groupby as much as possible and check the differences in those (small) groups through the month: 我认为这里的诀窍是尽可能地分组并检查这些(小)组中的差异:

inmates = pd.read_csv('inmates.csv')

# group by everything except _id and count number of entries
grouped = inmates.groupby(
    ['Gender', 'Race', 'Age at Booking', 'Current Age', 'Date']).count()

# pivot the dates out and transpose - this give us the number of each
# combination for each day
grouped = grouped.unstack().T.fillna(0)

# get the difference between each day of the month - the assumption here
# being that a negative number means someone left, 0 means that nothing
# has changed and positive means that someone new has come in. As you
# mentioned yourself, that isn't necessarily true
diffed = grouped.diff()

# replace the first day of the month with the grouped numbers to give
# the number in each group at the start of the month
diffed.iloc[0, :] = grouped.iloc[0, :]

# sum only the positive numbers in each row to count those that have
# arrived but ignore those that have left
diffed['total'] = diffed.apply(lambda x: x[x > 0].sum(), axis=1)

# sum total column
diffed['total'].sum()  # 3393

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM