简体   繁体   中英

split dataframe based on column value

I have a df that contains several IDs, I´m trying to run a regression to the data and I need to be able to split it by ID to apply the regression to each ID:

Sample DF (this is only a sample real data is larger)

在此处输入图像描述

I tried to save the ID´s within a list like this:

id_list = []

for data in df['id'].unique():
    id_list.append(data)

The list output is [1,2,3]

Then I was trying to use that to sort the DF:

def create_dataframe(df):

    for unique_id in id_list:
        df = df[df['Campaign ID'] == campaign_id]
        return df

when I call the function the result is:

在此处输入图像描述

However I only got the result for the first ID in the list,the other 2 [2,3] are not returning any DF... which means that at some point the loop breaks.

Here it is the entire code:

 df = pd.read_csv('budget.csv')

 id_list = []

 for unique_id in df['id'].unique():
     id_list.append(unique_id)


 def create_dataframe(df):

        for unique_id in id_list:
            df = df[df['Campaign ID'] == unique_id]
            return df

 print(create_dataframe(df)) 

You seem to be overnighting the df value in the for loop. I would recommend moving the df creation outside of the for loop and then append to it there. Then adding to it in each of the loops instead of overwriting it.

You can use the code snippet df.loc[df['id'] == item] to extract sub dataframes based on a particular value of a column in the dataframe.

Please refer the full code below

import pandas as pd

df_dict = {"id" : [1,1,1,2,2,2,3,3,3],
           "value" : [12,13,14,22,23,24,32,33,34]
           }

df = pd.DataFrame(df_dict)
print(df)
id_list = []
for data in df['id'].unique():
    id_list.append(data)

print(id_list)

for item in id_list:
    sub_df = df.loc[df['id'] == item]
    print(sub_df)
    print("****")

The following output will be generated for this with the requirement of getting the sub dataframes for each of the distinct column ids

 id  value
0   1     12
1   1     13
2   1     14
3   2     22
4   2     23
5   2     24
6   3     32
7   3     33
8   3     34
[1, 2, 3]
   id  value
0   1     12
1   1     13
2   1     14
****
   id  value
3   2     22
4   2     23
5   2     24
****
   id  value
6   3     32
7   3     33
8   3     34
****

Now in your code snippet the issue was that the function createdataframe() is getting called only once and inside the function when we iterate through the elements, after fetching the details of the sub df for id =1 you have used a return statement to return this df. Hence you are getting only the sub df for id = 1.

You can use numpy.split :

df.sort_values('id', inplace=True)
np.split(df, df.index[df.id.diff().fillna(0).astype(bool)])

or pandas groupby :

grp = df.groupby('id')
[grp.get_group(g) for g in df.groupby('id').groups]

Although I think you can make a regression directly using pandas groupby , since it logically apply any function you want taking each group as a distinct dataframe.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM