简体   繁体   中英

Get only first row per subject in dataframe

I was wondering if there is an easy way to get only the first row of each grouped object (subject id for example) in a dataframe. Doing this:

    for index, row in df.iterrows():
    # do stuff

gives us each one of the rows, but I am interested in doing something like this:

    groups = df.groupby('Subject id')
    for index, row in groups.iterrows():
    # give me the first row of each group
       continue

Is there a pythonic way to do the above?

Direct solution - without .groupby() - by .drop_duplicates()

what you want is to keep only the rows with first occurrencies in a specific column:

df.drop_duplicates(subset='Subject id', keep='first')

General solution

Using the .apply(func) in Pandas:

df.groupby('Subject id').apply(lambda df: df.iloc[0, :])

It applies a function (mostly on the fly generated with lambda ) to every data frame in the list of data frames returned by df.groupby() and aggregates the result to a single final data frame.

However, the solution by @AkshayNevrekar is really nice with .first() . And like he did there, you could also attach here - a .reset_index() at the end.

Let's say this is the more general solution - where you could also take any nth row ... - however, this works only if all sub-dataframes have at least n rows. Otherwise, use:

n = 3
col = 'Subject id'
res_df = pd.DataFrame()
for name, df in df.groupby(col):
    if n < (df.shape[0]):
        res_df = res_df.append(df.reset_index().iloc[n, :])

Or as a function:

def group_by_select_nth_row(df, col, n):
    res_df = pd.DataFrame()
    for name, df in df.groupby(col):
        if n < df.shape[0]:
            res_df = res_df.append(df.reset_index().iloc[n, :])
    return res_df

Quite confusing is that df.append() in contrast to list.append() only returns the appended value but leaves the original df unchanged. Therefore you should always reassign it if you want an 'in place' appending, like one is used from list.append() .

Use first() to get first row of each group.

df = pd.DataFrame({'subject_id': [1,1,2,2,2,3,4,4], 'val':[20,32,12,34,45,43,23,10]})

# print(df.groupby('subject_id').first().reset_index())
print(df.groupby('subject_id', as_index=False).first())

Output:

    subject_id  val
0   1          20
1   2          12
2   3          43
3   4          23

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM