简体   繁体   中英

Optimized iteration over dataframe (rows)

I am trying to process a dataframe. This includes creating columns and updating their values based on the values in other columns. In this case I have a given payment_type that I want to classify. It can fall under three categories: cash, deb_cred, gift_card . I want to add three new columns to the dataframe that are comprised of either 1's or 0's based on the given parameter

I am currently able to do this, it's just really slow (multiple hours on a AWS M4 instance on a dataset of ~70k rows, ~20 columns)...

Original column sample:

_id Payment tender types
1   debit
2   comptant
3   visa
4   mastercard
5   tim card
6   cash
7   gift

Desired output:

_id Payment tender types    pay_cash    pay_deb_cred    pay_gift
1   debit   0   1   0
2   comptant    1   0   0
3   visa    0   1   0
4   mastercard  0   1   0
5   tim card    0   0   1
6   cash    1   0   0
7   gift    0   0   1

My current code:
Note: data is a dataframe (70000,20) that has been loaded prior to this snippet

# For 'Payment tender types' we will use the following classes:
payment_cats = ['pay_cash', 'pay_deb_cred', 'pay_gift_card']
# [0, 0, 0] would imply 'other', hence no need for a fourth category

# note that certain types are just pieces of the name: e.g. master for "mastercard" and "master card"
types = ['debit', 'tim', 'cash', 'visa', 'amex', 'master',
     'digital', 'comptant', 'gift', 'débit']
cash_types = ['cash', 'comptant']
deb_cred_types = ['debit', 'visa', 'amex', 'master', 'digital', 'débit'
              'discover', 'bit', 'mobile']
gift_card_types = ['tim','gift']


# add new features to dataframe, initializing to nan
for cat in payment_cats:
    data[cat] = np.nan

for row in data.itertuples():
    # create series to hold the result per row e.g. [1, 0, 0] for `cash`
    cat = [0, 0, 0]
    index = row[0]
    # to string as some entries are numerical
    payment_type = row.paymenttendertypes.lower()
    if any(ct in payment_type for ct in cash_types):
        cat[0] = 1
    if any(dbt in payment_type for dbt in deb_cred_types):
        cat[1] = 1
    if any(gct in payment_type for gct in gift_card_types):
        cat[2] = 1
    # add series to payment_cat dataframe
    data.loc[index, payment_cats] = cat

I am using itertuples() as it proved faster than interrows().

Is there a faster way of achieving the same functionality as above? Could this be done without iterating over the entire df?

NOTE: This is not just with regards to creating a one hot encoding. It boils down to updating the column values dependent on the value of another column. Another use case for example is if I have a certain location_id I want to update its respective longitude and latitude columns - based on that original id (without iterating in the way that I do above because it's really slow for large datasets).

I'm pretty sure all you need is something like:

targets = cash_types, deb_cred_types, gift_card_types 
payments = data.Payment.str.lower()
for col_name, words in zip(payment_cats, targets):
    data[col_name] = payments.isin(words)

Note, your original code using itertuples is sort of strange because you keep indexing back into your data-frame, just to recover the row you are already iterating over , eg

 str(data.loc[index, 'payment_tender_types']).lower()

This could just be row.Payment.lower()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM