简体   繁体   中英

Reshaping data frame and counting values based on criteria

I have the data set below. I am trying to determine the type of customer by providing a tag. My excel crashes due to too much data when I attempt, so trying to complete with Python.

item  customer qty
------------------
ProdA CustA    1 
ProdA CustB    1
ProdA CustC    1
ProdA CustD    1
ProdB CustA    1
ProdB CustB    1

In Excel, I would:

1. Create new columns "ProdA", "ProdB", "Type"
2. Remove duplicates for column "customer"
3. COUNTIF Customer = ProdA, COUNTIF customer = ProdB
4. IF(AND(ProdA = 1, ProdB = 1), "Both", "One")


customer ProdA ProdB Type
--------------------------
CustA    1     1     Both
CustB    1     1     Both
CustC    1     0     One
CustD    1     0     One

Method 1:

We can achieve this using pd.crosstab , and then using the sum of ProdA and ProdB to Series.map 2 -> Both & 1 -> One :

dfn = pd.crosstab(df['customer'], df['item']).reset_index()
dfn['Type'] = dfn[['ProdA', 'ProdB']].sum(axis=1).map({2:'Both', 1:'One'})

Or we can use np.where in the last line to conditionally assign Both or One :

dfn['Type'] = np.where(dfn['ProdA'].eq(1) & dfn['ProdB'].eq(1), 'Both', 'One')
item customer  ProdA  ProdB  Type
0       CustA      1      1  Both
1       CustB      1      1  Both
2       CustC      1      0   One
3       CustD      1      0   One

Method 2

We can also use pd.crosstab more extensively with the margins=True argument:

dfn = pd.crosstab(df['customer'], df['item'], 
                  margins=True, 
                  margins_name='Type').iloc[:-1].reset_index()

dfn['Type'] = dfn['Type'].map({2:'Both', 1:'One'})
item customer  ProdA  ProdB  Type
0       CustA      1      1  Both
1       CustB      1      1  Both
2       CustC      1      0   One
3       CustD      1      0   One

Try using set_index , unstack and np.select :

df_out = df.set_index(['customer', 'item'])['qty'].unstack(fill_value=0)
SumProd = df_out['ProdA'] + df_out['ProdB']
df_out['Type'] = np.select([SumProd==2, SumProd==1, SumProd==0],['Both', 'One', 'None'])
print(df_out)

Output:

item      ProdA  ProdB  Type
customer                    
CustA         1      1  Both
CustB         1      1  Both
CustC         1      0   One
CustD         1      0   One

In addition to the other suggestions, you could skip Pandas entirely:

################################################################################
## Data ingestion
################################################################################
import csv
import StringIO

# Formated to make the example more straightforward.
input_data = StringIO.StringIO('''item,customer,qty
ProdA,CustA,1
ProdA,CustB,1
ProdA,CustC,1
ProdA,CustD,1
ProdB,CustA,1
ProdB,CustB,1
''')

records = []
reader = csv.DictReader(input_data)
for row in reader:
  records.append(row)

################################################################################
## Data transformation.
## Makes a Dict-of-Dicts. Each inner Dict contains all data for a single
## customer. 
################################################################################
products = {'ProdA', 'ProdB'}
customer_data = {}

for r in records:
  customer_id = r['customer']
  if not customer_id in customer_data:
    customer_data[customer_id] = {}
  customer_data[customer_id][r['item']] = int(r['qty'])

# Determines the customer type. 
for c in customer_data:
  c_data = customer_data[c]
  missing_product = products.difference(c_data.keys())
  matching_product = products.intersection(c_data.keys())
  if missing_product:
    for missing_p in missing_product:
      c_data[missing_p] = 0
    c_data['type'] = 'One'
  else:
    c_data['type'] = 'Both'

################################################################################
## Data display
################################################################################
for i, c in enumerate(customer_data):
  if i == 0:
    print('\t'.join(['ID'] + customer_data[c].keys()))
  print('\t'.join([c] + [str(x) for x in customer_data[c].values()]))

Which, for me, prints this

ID      ProdA   type    ProdB
CustC   1       One     0
CustB   1       Both    1
CustA   1       Both    1
CustD   1       One     0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM