简体   繁体   中英

how to unstack a pandas dataframe with two sets of variables

I have a table that looks like this. Read from a CSV file, so no levels, no fancy indices, etc.

ID  date1      amount1    date2        amount2
x   15/1/2015   100        15/1/2016   80

The actual file I have goes up to date5 and amount 5. How can I convert it to:

ID  date       amount
x   15/1/2015  100
x   15/1/2016   80

If I only had one variable, I would use pandas.melt(), but with two variables I really don't know how to do it quickly.

I could do it manually exporting to a sqlite3 database in memory, and doing a union. Doing unions in pandas is more annoying because, unlike, SQL, it requires all field names to be the same, so in pandas I'd have to create a temporary dataframe and rename all the fields: a dataframe for date1 and amount1, rename the field to just date and amount, then do the same for all the other events, and only then can I do pandas.concat.

Any suggestions? Thanks!

Here is one way:

>>> pandas.concat(
...     [pandas.melt(x, id_vars='ID', value_vars=x.columns[1::2].tolist(), value_name='date'),
...      pandas.melt(x, value_vars=x.columns[2::2].tolist(), value_name='amount')
...     ],
...     axis=1
... ).drop('variable', axis=1)
  ID       date  amount
0  x  15/1/2015     100
1  x  15/1/2016      80

The idea is to do two melts, one for each set of columns, then concat them. This assumes that the two kinds of columns are in alternating order, so that the columns[1::2] and columns[2::2] select them correctly. If not, you'd have to modify that part of it to choose the columns you want.

You can also do it with the little-known lreshape :

>>> pandas.lreshape(x, {'date': x.columns[1::2], 'amount': x.columns[2::2]})
  ID       date  amount
0  x  15/1/2015     100
1  x  15/1/2016      80

However, lreshape is not really documented and it's not clear if it's supposed to be used.

If I assume that the columns always repeat, a simple trick provides the solution you want.

The trick lies in making a list of lists of the columns that go together, then looping over the main list appending as necessary. It does involve a call to pd.DataFrame() each time the loop runs. I am kind of pressed for time right now to find a way to avoid that. But it does work like you would expect it to, and for a small file, you should not have any problems (that is, run time).

In [1]: columns = [['date1', 'amount1'], ['date2', 'amount2'], ...]

In [2]: df_clean = pd.DataFrame(columns=['date', 'amount'])
        for cols in columns:
            df_clean = df_clean.append(pd.DataFrame(df.loc[:,cols].values,
                                                    columns=['date', 'amount']),
                                       ignore_index=True)
df_clean
Out[2]:     date        amount
        0   15/1/2015   100
        1   15/1/2016   80

The neat thing about this is that it only runs over the DataFrame once, picking all the rows under the columns it is looping over. So if you have 5 column pairs, with 'n' rows under it, the loop will only run 5 times. For each run, it will append all 'n' rows below the columns to the clean DataFrame to give you a consistent result. You can then eliminate any NaN values and sort by date, or do whatever you want to do with the clean DF.

What do you think, does this beat creating an in-memory sqlite3 database?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM