根據標准重塑數據框和計數值

Question

我有下面的數據集。 我試圖通過提供標簽來確定客戶的類型。 當我嘗試時，由於數據過多，我的 excel 崩潰，因此嘗試使用 Python 完成。

item  customer qty
------------------
ProdA CustA    1 
ProdA CustB    1
ProdA CustC    1
ProdA CustD    1
ProdB CustA    1
ProdB CustB    1

在 Excel 中，我會：

1. Create new columns "ProdA", "ProdB", "Type"
2. Remove duplicates for column "customer"
3. COUNTIF Customer = ProdA, COUNTIF customer = ProdB
4. IF(AND(ProdA = 1, ProdB = 1), "Both", "One")


customer ProdA ProdB Type
--------------------------
CustA    1     1     Both
CustB    1     1     Both
CustC    1     0     One
CustD    1     0     One

Answer 1

方法一：

我們可以使用pd.crosstab實現這pd.crosstab ，然后使用ProdA和ProdB的總和到Series.map 2 -> Both & 1 -> One ：

dfn = pd.crosstab(df['customer'], df['item']).reset_index()
dfn['Type'] = dfn[['ProdA', 'ProdB']].sum(axis=1).map({2:'Both', 1:'One'})

或者我們可以在最后一行中使用np.where有條件地分配Both或One ：

dfn['Type'] = np.where(dfn['ProdA'].eq(1) & dfn['ProdB'].eq(1), 'Both', 'One')

item customer  ProdA  ProdB  Type
0       CustA      1      1  Both
1       CustB      1      1  Both
2       CustC      1      0   One
3       CustD      1      0   One

方法二

我們還可以通過margins=True參數更廣泛地使用pd.crosstab ：

dfn = pd.crosstab(df['customer'], df['item'], 
                  margins=True, 
                  margins_name='Type').iloc[:-1].reset_index()

dfn['Type'] = dfn['Type'].map({2:'Both', 1:'One'})

item customer  ProdA  ProdB  Type
0       CustA      1      1  Both
1       CustB      1      1  Both
2       CustC      1      0   One
3       CustD      1      0   One

Answer 2

嘗試使用set_index ， unstack和np.select ：

df_out = df.set_index(['customer', 'item'])['qty'].unstack(fill_value=0)
SumProd = df_out['ProdA'] + df_out['ProdB']
df_out['Type'] = np.select([SumProd==2, SumProd==1, SumProd==0],['Both', 'One', 'None'])
print(df_out)

輸出：

item      ProdA  ProdB  Type
customer                    
CustA         1      1  Both
CustB         1      1  Both
CustC         1      0   One
CustD         1      0   One

Answer 3

除了其他建議之外，您還可以完全跳過 Pandas：

################################################################################
## Data ingestion
################################################################################
import csv
import StringIO

# Formated to make the example more straightforward.
input_data = StringIO.StringIO('''item,customer,qty
ProdA,CustA,1
ProdA,CustB,1
ProdA,CustC,1
ProdA,CustD,1
ProdB,CustA,1
ProdB,CustB,1
''')

records = []
reader = csv.DictReader(input_data)
for row in reader:
  records.append(row)

################################################################################
## Data transformation.
## Makes a Dict-of-Dicts. Each inner Dict contains all data for a single
## customer. 
################################################################################
products = {'ProdA', 'ProdB'}
customer_data = {}

for r in records:
  customer_id = r['customer']
  if not customer_id in customer_data:
    customer_data[customer_id] = {}
  customer_data[customer_id][r['item']] = int(r['qty'])

# Determines the customer type. 
for c in customer_data:
  c_data = customer_data[c]
  missing_product = products.difference(c_data.keys())
  matching_product = products.intersection(c_data.keys())
  if missing_product:
    for missing_p in missing_product:
      c_data[missing_p] = 0
    c_data['type'] = 'One'
  else:
    c_data['type'] = 'Both'

################################################################################
## Data display
################################################################################
for i, c in enumerate(customer_data):
  if i == 0:
    print('\t'.join(['ID'] + customer_data[c].keys()))
  print('\t'.join([c] + [str(x) for x in customer_data[c].values()]))

對我來說，打印這個

ID      ProdA   type    ProdB
CustC   1       One     0
CustB   1       Both    1
CustA   1       Both    1
CustD   1       One     0

根據標准重塑數據框和計數值

問題描述

3 個解決方案

解決方案1
2 已采納 2019-12-18 20:26:20

方法一：

方法二

解決方案2
2 2019-12-18 20:28:31

解決方案3
0 2019-12-18 22:35:27

根據標准重塑數據框和計數值

問題描述

3 個解決方案

解決方案1 2 已采納 2019-12-18 20:26:20

方法一：

方法二

解決方案2 2 2019-12-18 20:28:31

解決方案3 0 2019-12-18 22:35:27

解決方案1
2 已采納 2019-12-18 20:26:20

解決方案2
2 2019-12-18 20:28:31

解決方案3
0 2019-12-18 22:35:27