简体   繁体   中英

Creating a new data frame from an existing data frame based on a condition (for loops, Python)

I have a data frame (data) with columns such as net sales, product, vendor etc. I would have to create sub-data frames from this main_data table per each vendor. Lets say that there are 5 unique vendors (vendor1, vendor2, vendor3, vendor4 and vendor5) in the data table vendor column. I would have to create 5 different sub-data frames for each of these vendors. The sub-data frames should contain all data from the main table, but filtered for vendorX.

How would I do this by using for loops?

If you are using pandas, you can do:

df_v1 = main_data[main_data['vendor'] =='vendor1']

Let's say below is your dataFrame: 在此处输入图像描述

As it can be seen in above image, there are 5 vendors(v1,v2,v3,v4,v5)

Code:

import pandas as pd
import numpy as np

#importing dataFrame from dump excel
df = pd.read_excel('stack.xlsx')

dfList  = list(set(df['vendor'])) 


dfNames = ["df" + row for row in dfList] 

for i, row in enumerate(dfList):
    dfName = dfNames[i]
    dfNew = df[df['vendor'] == row]
    globals()[dfName] = dfNew
    print(globals()[dfName])
    print('------------------------------------------') 

#from above for loop there will be 5 dataFrames generated as dfv1, dfv3, dfv5, dfv4, dfv2. You can use these all dataFrames now

Output: 在此处输入图像描述

Consider this:

import pandas as pd

data = {'product': ['P1', 'P2', 'P3', 'P4', 'P5', 'P6', 'P7'], 
        'vendor': ['vendor1','vendor2','vendor2','vendor1','vendor3', 'vendor2', 'vendor3'] }
main_data = pd.DataFrame(data)

print('Original dataframe:')
print(main_data)
print('-----')

#this will store key value pairs of vendorX, sub_dataframe for vendorX
by_vendor = dict()
for vendorX in main_data.vendor.unique():
    maskX = main_data['vendor'] == vendorX
    by_vendor[vendorX] = main_data[maskX]

for vendorX, sub_data in by_vendor.items():
    print('subdataframes for vendor ', vendorX)
    print(sub_data)
    print('-----')

This is the output:

Original dataframe:
  product   vendor
0      P1  vendor1
1      P2  vendor2
2      P3  vendor2
3      P4  vendor1
4      P5  vendor3
5      P6  vendor2
6      P7  vendor3
-----
subdataframe for vendor  vendor1
  product   vendor
0      P1  vendor1
3      P4  vendor1
-----
subdataframe for vendor  vendor2
  product   vendor
1      P2  vendor2
2      P3  vendor2
5      P6  vendor2
-----
subdataframe for vendor  vendor3
  product   vendor
4      P5  vendor3
6      P7  vendor3
-----

Note that the output has three vendors in this case, but would have more if main_data had more of them. This code can handle any number of unique vendors.

Here, the answer is stored in a dictionary named by_vendor , which stores sub_data dataframe for vendorX , which can be accessed by by_vendor[vendorX] ( by_vendor['vendor1'] , by_vendor['vendor2'] , etc).

The line for vendorX in main_data.vendor.unique(): iterates over all the unique entries present in the vendor column. For each unique vendor vendorX , we do the following:

maskX is a series containg a True / False value for each row, depending on whether the vendor for that row equals vendorX or not.

We use this maskX with boolean indexing to create sub_data dataframe for vendorX .

The left hand side of the expression is simply assigning the sub_data belonging to vendorX in a dictionary with key vendorX .

The two statements can be combined into a single one: by_vendor[vendorX] = main_data[main_data['vendor'] == vendorX]

You can ditch the by_vendor dictionary and still use boolean indexing to manually put values into five variables named vendorX if you'd like, I found this method to be more elegant as it can be applied to any case.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM