I am trying to process a dataframe. This includes creating columns and updating their values based on the values in other columns. In this case I have a given payment_type
that I want to classify. It can fall under three categories: cash, deb_cred, gift_card
. I want to add three new columns to the dataframe that are comprised of either 1's or 0's based on the given parameter
I am currently able to do this, it's just really slow (multiple hours on a AWS M4 instance on a dataset of ~70k rows, ~20 columns)...
Original column sample:
_id Payment tender types
1 debit
2 comptant
3 visa
4 mastercard
5 tim card
6 cash
7 gift
Desired output:
_id Payment tender types pay_cash pay_deb_cred pay_gift
1 debit 0 1 0
2 comptant 1 0 0
3 visa 0 1 0
4 mastercard 0 1 0
5 tim card 0 0 1
6 cash 1 0 0
7 gift 0 0 1
My current code:
Note: data
is a dataframe (70000,20) that has been loaded prior to this snippet
# For 'Payment tender types' we will use the following classes:
payment_cats = ['pay_cash', 'pay_deb_cred', 'pay_gift_card']
# [0, 0, 0] would imply 'other', hence no need for a fourth category
# note that certain types are just pieces of the name: e.g. master for "mastercard" and "master card"
types = ['debit', 'tim', 'cash', 'visa', 'amex', 'master',
'digital', 'comptant', 'gift', 'débit']
cash_types = ['cash', 'comptant']
deb_cred_types = ['debit', 'visa', 'amex', 'master', 'digital', 'débit'
'discover', 'bit', 'mobile']
gift_card_types = ['tim','gift']
# add new features to dataframe, initializing to nan
for cat in payment_cats:
data[cat] = np.nan
for row in data.itertuples():
# create series to hold the result per row e.g. [1, 0, 0] for `cash`
cat = [0, 0, 0]
index = row[0]
# to string as some entries are numerical
payment_type = row.paymenttendertypes.lower()
if any(ct in payment_type for ct in cash_types):
cat[0] = 1
if any(dbt in payment_type for dbt in deb_cred_types):
cat[1] = 1
if any(gct in payment_type for gct in gift_card_types):
cat[2] = 1
# add series to payment_cat dataframe
data.loc[index, payment_cats] = cat
I am using itertuples() as it proved faster than interrows().
Is there a faster way of achieving the same functionality as above? Could this be done without iterating over the entire df?
NOTE: This is not just with regards to creating a one hot encoding. It boils down to updating the column values dependent on the value of another column. Another use case for example is if I have a certain location_id I want to update its respective longitude and latitude columns - based on that original id (without iterating in the way that I do above because it's really slow for large datasets).
I'm pretty sure all you need is something like:
targets = cash_types, deb_cred_types, gift_card_types
payments = data.Payment.str.lower()
for col_name, words in zip(payment_cats, targets):
data[col_name] = payments.isin(words)
Note, your original code using itertuples
is sort of strange because you keep indexing back into your data-frame, just to recover the row you are already iterating over , eg
str(data.loc[index, 'payment_tender_types']).lower()
This could just be row.Payment.lower()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.