简体   繁体   中英

Using python pandas dataframe to rearrange continuous data log

As quite a newbie to pandas, I'm struggling with a data arrangement issue.

I've got a huge pile of data from a log file in a pandas dataframe with a structure like this:

day   user   measure1   measure2   ...
1     u1     xxxxx      yyyyy      ...
1     u2     xxxxx      yyyyy      ...
1     u3     xxxxx      yyyyy      ...
2     u2     xxxxx      yyyyy      ...
2     u4     xxxxx      yyyyy      ...
2     u3     xxxxx      yyyyy      ...
3     u1     xxxxx      yyyyy      ...
3     u3     xxxxx      yyyyy      ...
...   ...    ...        ...        ...

Hence, not every user appears at each day, while the data is neither sorted by day nor by user. However, if an entry occurs, is has all the measures.

Now I need to rearrange this data to obtain a 2D table "every user" vs. "every day" for each measure and fill the gaps with zeros eg

For measure1:                      For measure2:
      u1     u2     u3     u4            u1     u2     u3     u4
1  xxxxx  xxxxx  xxxxx      0      1  yyyyy  yyyyy  yyyyy      0  
2      0  xxxxx  xxxxx  xxxxx      2      0  yyyyy  yyyyy  yyyyy  
3  xxxxx      0  xxxxx      0      3  yyyyy      0  yyyyy      0  

How can I do this with pandas in python3? I'm also open to alternative solutions eg using numpy instead of pandas.

So far I managed to extract arrays of all occurring users and days in the dataset but have no clue how to smartly assign the measured data.

I'm grateful for any help on this matter.

It seems like you want a multi-index dataframe (index1: day, index2: measure)

The tricky part is that you might need to transpose your dataframe before these operations. Have a look at the answer of this issue which looks similar to yours Constructing 3D Pandas DataFrame

Hope it helps

You need set_index and unstack

df.set_index(['day','user']).measure1.unstack(fill_value=0)
Out[6]: 
user     u1     u2     u3     u4
day                             
1     xxxxx  xxxxx  xxxxx      0
2         0  xxxxx  xxxxx  xxxxx
3     xxxxx      0  xxxxx      0
df.set_index(['day','user']).measure2.unstack(fill_value=0)
Out[7]: 
user     u1     u2     u3     u4
day                             
1     yyyyy  yyyyy  yyyyy      0
2         0  yyyyy  yyyyy  yyyyy
3     yyyyy      0  yyyyy      0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM