Trying to get distinct values for columns after grouping; but am getting a total sum groupby is dropping object distinction and losing leading zeros for column
df = pd.read_csv("trial.txt",sep='|',converters={'zip':str},keep_default_na=True,low_memory=False)
Data:
Emp State Zip Jan feb mar
Int NY 11111 1 0 1
int NY 11111 1 1 0
int NC 09999 2 2 0
int ON NH443 2 2 2
after
df2 = df.groupby("Zip").count()
df2 for zip my output for zip = 11111 i'll have the output for all 12 months show 2 2 2. Were I would expect 2 1 1 and zip 09999 shows as 9999.
How what is wrong about the grouping to not get distinct column values. Have account for non-null values (there are no nulls). Column value is only (0 , 1 ,2)
Let's start from why Zip column is read as int .
Note the subtle difference:
Change it to converters={'Zip': str} and the column in question will be read as string .
Another, maybe better solution is to pass dtype={'Zip': object} instead. I wrote better because no conversion is actually needed. It is enough just to specify the column type.
I performed also additional check, ie I changed Zip in the last row from NH443 to 88443 .
So even if this column in the source file contains only digits (in all rows), its type will be object , meaning actually string .
And now let's look at what is actual result of your code:
When you run df2 = df.groupby('Zip').count()
, the result is:
Emp State Jan Feb Mar
Zip
09999 1 1 1 1 1
11111 2 2 2 2 2
88443 1 1 1 1 1
Note the description of count function ( GroupBy variant), which reads: Compute count of group, excluding missing values . As you didn't pass any column list, this count is computed for all columns.
So your code does'n generate distinct values, but instead it counts non-null values along each column in each group.
If you want what you wrote in the title, ie a list of distinct values , for each column (not their counts), run:
df.groupby('Zip').agg(lambda col: np.sort(col.unique()).tolist())
This time the result is:
Emp State Jan Feb Mar
Zip
09999 [Int] [NC] [2] [2] [0]
11111 [Int] [NY] [1] [0, 1] [0, 1]
88443 [Int] [ON] [2] [2] [2]
Eg group 11111 contains in Feb and Mar columns two distinct values: 0 and 1 .
In all other cases (for your sample data) each column in each group contains a single distinct value, so corresponding lists contain only a single element.
count
returns the count of each group, excluding missing values. That means that a value of zero would also be included in the count. To only count positive values, you can apply a lambda function that sums the count of values greater than zero.
>>> df.groupby('Zip')[['Jan', 'feb', 'mar']].apply(lambda x: x.gt(0).sum())
Jan feb mar
Zip
09999 1 1 0
11111 2 1 1
NH443 1 1 1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.