簡體   English   中英

根據標准重塑數據框和計數值

[英]Reshaping data frame and counting values based on criteria

我有下面的數據集。 我試圖通過提供標簽來確定客戶的類型。 當我嘗試時,由於數據過多,我的 excel 崩潰,因此嘗試使用 Python 完成。

item  customer qty
------------------
ProdA CustA    1 
ProdA CustB    1
ProdA CustC    1
ProdA CustD    1
ProdB CustA    1
ProdB CustB    1

在 Excel 中,我會:

1. Create new columns "ProdA", "ProdB", "Type"
2. Remove duplicates for column "customer"
3. COUNTIF Customer = ProdA, COUNTIF customer = ProdB
4. IF(AND(ProdA = 1, ProdB = 1), "Both", "One")


customer ProdA ProdB Type
--------------------------
CustA    1     1     Both
CustB    1     1     Both
CustC    1     0     One
CustD    1     0     One

方法一:

我們可以使用pd.crosstab實現這pd.crosstab ,然后使用ProdAProdB的總和到Series.map 2 -> Both & 1 -> One

dfn = pd.crosstab(df['customer'], df['item']).reset_index()
dfn['Type'] = dfn[['ProdA', 'ProdB']].sum(axis=1).map({2:'Both', 1:'One'})

或者我們可以在最后一行中使用np.where有條件地分配BothOne

dfn['Type'] = np.where(dfn['ProdA'].eq(1) & dfn['ProdB'].eq(1), 'Both', 'One')
item customer  ProdA  ProdB  Type
0       CustA      1      1  Both
1       CustB      1      1  Both
2       CustC      1      0   One
3       CustD      1      0   One

方法二

我們還可以通過margins=True參數更廣泛地使用pd.crosstab

dfn = pd.crosstab(df['customer'], df['item'], 
                  margins=True, 
                  margins_name='Type').iloc[:-1].reset_index()

dfn['Type'] = dfn['Type'].map({2:'Both', 1:'One'})
item customer  ProdA  ProdB  Type
0       CustA      1      1  Both
1       CustB      1      1  Both
2       CustC      1      0   One
3       CustD      1      0   One

嘗試使用set_indexunstacknp.select

df_out = df.set_index(['customer', 'item'])['qty'].unstack(fill_value=0)
SumProd = df_out['ProdA'] + df_out['ProdB']
df_out['Type'] = np.select([SumProd==2, SumProd==1, SumProd==0],['Both', 'One', 'None'])
print(df_out)

輸出:

item      ProdA  ProdB  Type
customer                    
CustA         1      1  Both
CustB         1      1  Both
CustC         1      0   One
CustD         1      0   One

除了其他建議之外,您還可以完全跳過 Pandas:

################################################################################
## Data ingestion
################################################################################
import csv
import StringIO

# Formated to make the example more straightforward.
input_data = StringIO.StringIO('''item,customer,qty
ProdA,CustA,1
ProdA,CustB,1
ProdA,CustC,1
ProdA,CustD,1
ProdB,CustA,1
ProdB,CustB,1
''')

records = []
reader = csv.DictReader(input_data)
for row in reader:
  records.append(row)

################################################################################
## Data transformation.
## Makes a Dict-of-Dicts. Each inner Dict contains all data for a single
## customer. 
################################################################################
products = {'ProdA', 'ProdB'}
customer_data = {}

for r in records:
  customer_id = r['customer']
  if not customer_id in customer_data:
    customer_data[customer_id] = {}
  customer_data[customer_id][r['item']] = int(r['qty'])

# Determines the customer type. 
for c in customer_data:
  c_data = customer_data[c]
  missing_product = products.difference(c_data.keys())
  matching_product = products.intersection(c_data.keys())
  if missing_product:
    for missing_p in missing_product:
      c_data[missing_p] = 0
    c_data['type'] = 'One'
  else:
    c_data['type'] = 'Both'

################################################################################
## Data display
################################################################################
for i, c in enumerate(customer_data):
  if i == 0:
    print('\t'.join(['ID'] + customer_data[c].keys()))
  print('\t'.join([c] + [str(x) for x in customer_data[c].values()]))

對我來說,打印這個

ID      ProdA   type    ProdB
CustC   1       One     0
CustB   1       Both    1
CustA   1       Both    1
CustD   1       One     0

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM