简体   繁体   中英

Pandas long format DataFrame from multiple lists of different length

Consider I have multiple lists

A = [1, 2, 3]
B = [1, 4]

and I want to generate a Pandas DataFrame in long format as follows:

type | value
------------
A    | 1
A    | 2
A    | 3
B    | 1
B    | 4

What is the easiest way to achieve this? The way over the wide format and melt is not possible(?) because the lists may have different lengths.

Create dictionary for type s and create list of tuples by list comprehension:

A = [1, 2, 3]
B = [1, 4]

d = {'A':A,'B':B}

print ([(k, y) for k, v in d.items() for y in v])
[('A', 1), ('A', 2), ('A', 3), ('B', 1), ('B', 4)]

df = pd.DataFrame([(k, y) for k, v in d.items() for y in v], columns=['type','value'])
print (df)
  type  value
0    A      1
1    A      2
2    A      3
3    B      1
4    B      4

Another solution, if input is list of lists and type s should be integers:

L = [A,B]
df = pd.DataFrame([(k, y) for k, v in enumerate(L) for y in v], columns=['type','value'])
print (df)
   type  value
0     0      1
1     0      2
2     0      3
3     1      1
4     1      4

Here's a NumPy-based solution using a dictionary input:

d = {'A': [1, 2, 3],
     'B': [1, 4]}

keys, values = zip(*d.items())

res = pd.DataFrame({'type': np.repeat(keys, list(map(len, values))),
                    'value': np.concatenate(values)})

print(res)

  type  value
0    A      1
1    A      2
2    A      3
3    B      1
4    B      4

Check this, this borrows the idea from dplyr, tidyr, R programming languages' 3rd libs, the following code is just for demo, so I created two df: df1, df2, you can dynamically create dfs and concat them:

import pandas as pd


def gather(df, key, value, cols):
    id_vars = [col for col in df.columns if col not in cols]
    id_values = cols
    var_name = key
    value_name = value
    return pd.melt(df, id_vars, id_values, var_name, value_name)


df1 = pd.DataFrame({'A': [1, 2, 3]})

df2 = pd.DataFrame({'B': [1, 4]})

df_messy = pd.concat([df1, df2], axis=1)

print(df_messy)

df_tidy = gather(df_messy, 'type', 'value', df_messy.columns).dropna()

print(df_tidy)

And you got output for df_messy

   A    B
0  1  1.0
1  2  4.0
2  3  NaN

output for df_tidy

  type  value
0    A    1.0
1    A    2.0
2    A    3.0
3    B    1.0
4    B    4.0

PS: Remeber to convert the type of values from float to int type, I just wrote it down for a demo, and didn't pay too much attention about the details.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM