I have the data set below. I am trying to determine the type of customer by providing a tag. My excel crashes due to too much data when I attempt, so trying to complete with Python.
item customer qty
------------------
ProdA CustA 1
ProdA CustB 1
ProdA CustC 1
ProdA CustD 1
ProdB CustA 1
ProdB CustB 1
In Excel, I would:
1. Create new columns "ProdA", "ProdB", "Type"
2. Remove duplicates for column "customer"
3. COUNTIF Customer = ProdA, COUNTIF customer = ProdB
4. IF(AND(ProdA = 1, ProdB = 1), "Both", "One")
customer ProdA ProdB Type
--------------------------
CustA 1 1 Both
CustB 1 1 Both
CustC 1 0 One
CustD 1 0 One
We can achieve this using pd.crosstab
, and then using the sum of ProdA
and ProdB
to Series.map
2 -> Both
& 1 -> One
:
dfn = pd.crosstab(df['customer'], df['item']).reset_index()
dfn['Type'] = dfn[['ProdA', 'ProdB']].sum(axis=1).map({2:'Both', 1:'One'})
Or we can use np.where
in the last line to conditionally assign Both
or One
:
dfn['Type'] = np.where(dfn['ProdA'].eq(1) & dfn['ProdB'].eq(1), 'Both', 'One')
item customer ProdA ProdB Type
0 CustA 1 1 Both
1 CustB 1 1 Both
2 CustC 1 0 One
3 CustD 1 0 One
We can also use pd.crosstab
more extensively with the margins=True
argument:
dfn = pd.crosstab(df['customer'], df['item'],
margins=True,
margins_name='Type').iloc[:-1].reset_index()
dfn['Type'] = dfn['Type'].map({2:'Both', 1:'One'})
item customer ProdA ProdB Type
0 CustA 1 1 Both
1 CustB 1 1 Both
2 CustC 1 0 One
3 CustD 1 0 One
Try using set_index
, unstack
and np.select
:
df_out = df.set_index(['customer', 'item'])['qty'].unstack(fill_value=0)
SumProd = df_out['ProdA'] + df_out['ProdB']
df_out['Type'] = np.select([SumProd==2, SumProd==1, SumProd==0],['Both', 'One', 'None'])
print(df_out)
Output:
item ProdA ProdB Type
customer
CustA 1 1 Both
CustB 1 1 Both
CustC 1 0 One
CustD 1 0 One
In addition to the other suggestions, you could skip Pandas entirely:
################################################################################
## Data ingestion
################################################################################
import csv
import StringIO
# Formated to make the example more straightforward.
input_data = StringIO.StringIO('''item,customer,qty
ProdA,CustA,1
ProdA,CustB,1
ProdA,CustC,1
ProdA,CustD,1
ProdB,CustA,1
ProdB,CustB,1
''')
records = []
reader = csv.DictReader(input_data)
for row in reader:
records.append(row)
################################################################################
## Data transformation.
## Makes a Dict-of-Dicts. Each inner Dict contains all data for a single
## customer.
################################################################################
products = {'ProdA', 'ProdB'}
customer_data = {}
for r in records:
customer_id = r['customer']
if not customer_id in customer_data:
customer_data[customer_id] = {}
customer_data[customer_id][r['item']] = int(r['qty'])
# Determines the customer type.
for c in customer_data:
c_data = customer_data[c]
missing_product = products.difference(c_data.keys())
matching_product = products.intersection(c_data.keys())
if missing_product:
for missing_p in missing_product:
c_data[missing_p] = 0
c_data['type'] = 'One'
else:
c_data['type'] = 'Both'
################################################################################
## Data display
################################################################################
for i, c in enumerate(customer_data):
if i == 0:
print('\t'.join(['ID'] + customer_data[c].keys()))
print('\t'.join([c] + [str(x) for x in customer_data[c].values()]))
Which, for me, prints this
ID ProdA type ProdB
CustC 1 One 0
CustB 1 Both 1
CustA 1 Both 1
CustD 1 One 0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.