Pandas: Select First 3 Rows With Distinct Values in Specific Column

Question

I'm working with a Pandas dataframe like this:

  col1 col2 id  name
0   1   2   1   harry
1   2   3   1   harry
2   3   4   2   harry
3   4   5   3   harry
4   5   6   4   harry
5   1   9   6   jane
6   2   10  6   jane
7   3   11  7   jane
8   4   12  7   jane
9   5   12  8   jane

I'd like to create a new dataframe where, for each name , I take the first 3 rows with unique id values.

For instance, for harry , this would be rows 0, 2, and 3. Row 1 is excluded since it has the same id as row 0.

The correct output for my example dataframe is:

  col1 col2 id  name
0   1   2   1   harry
2   3   4   2   harry
3   4   5   3   harry
5   1   9   6   jane
7   3   11  7   jane
9   5   12  8   jane

Example dataframe is given by this code:

example = pd.DataFrame({"col1":[1,2,3,4,5, 1,2,3,4,5], "id":[1,1,2,3,4, 6, 6, 7, 7, 8],
                        "col2":[2,3,4,5,6, 9, 10, 11, 12, 12],
                        "name":["harry", "harry", "harry", "harry", "harry", "jane",
                               "jane","jane","jane","jane",]})

This code works but is very ugly and not vectorized:

result_df = pd.DataFrame(columns=example.columns)
names_to_ids = {}
for i, row in example.iterrows():
    curr_name = row["name"]
    curr_id = row["id"]
    print curr_name, curr_id
    if curr_name not in names_to_ids:
        result_df = result_df.append(row)
        names_to_ids[curr_name] = [curr_id]
    elif len(names_to_ids[curr_name]) < 3 and curr_id not in names_to_ids[curr_name]:
        result_df = result_df.append(row)
        names_to_ids[curr_name].append(curr_id)

Answer 1

Using drop_duplicates and head :

df.drop_duplicates(['id', 'name']).groupby('name').head(3)

   col1  col2  id   name
0     1     2   1  harry
2     3     4   2  harry
3     4     5   3  harry
5     1     9   6   jane
7     3    11   7   jane
9     5    12   8   jane

Answer 2

Another way is to use double groupby and head

In [183]: df.groupby(['name', 'id']).head(1).groupby('name').head(3)
Out[183]:
   col1  col2  id   name
0     1     2   1  harry
2     3     4   2  harry
3     4     5   3  harry
5     1     9   6   jane
7     3    11   7   jane
9     5    12   8   jane

Answer 3

Using drop_duplicates and then GroupBy + cumcount :

res = df.drop_duplicates(['id', 'name'])
res = res.loc[res.groupby('name').cumcount().lt(3)]

print(res)

   col1  col2  id   name
0     1     2   1  harry
2     3     4   2  harry
3     4     5   3  harry
5     1     9   6   jane
7     3    11   7   jane
9     5    12   8   jane

Pandas: Select First 3 Rows With Distinct Values in Specific Column

Question

3 answers

solution1
3 2018-09-11 17:15:21

solution2
0 2018-09-11 17:16:36

solution3
0 2018-09-11 17:36:53

Pandas: Select First 3 Rows With Distinct Values in Specific Column

Question

3 answers

solution1 3 2018-09-11 17:15:21

solution2 0 2018-09-11 17:16:36

solution3 0 2018-09-11 17:36:53

solution1
3 2018-09-11 17:15:21

solution2
0 2018-09-11 17:16:36

solution3
0 2018-09-11 17:36:53