简体   繁体   中英

Converting nested list to pandas dataframe with column names

Image of Original DataFrame

I have a nested list that looks something like this.

features = 
[['0:0.084556', '1:0.138594', '2:0.094304\n'],
 ['0:0.101468', '4:0.138594', '5:0.377215\n'],
 ['0:0.135290', '2:0.277187', '3:0.141456\n']
]

Each list within the nested list is a row that is comma separated. The left side of the ":" is the column name and the right side is the row value.

I want to transform this to a pandas data frame to look like this:

  f_0000  |  f_0001  |  f_0002  |  f_0003  |  f_0004  | f_0005
---------------------------------------------------------------
 0.084556 | 0.138594 | 0.094304 | 0.000000 | 0.000000 | 0.000000
 0.101468 | 0.000000 | 0.000000 | 0.000000 | 0.138594 | 0.377215
 0.135290 | 0.000000 | 0.277187 | 0.141456 | 0.000000 | 0.000000

Can someone help me out on how to do this?

Original DF (it doesn't format correctly on pd.read_clipboard for me however..)

    ex_id   labels  features
0   0   446,521,1149,1249,1265,1482 0:0.084556 1:0.138594 2:0.094304 3:0.195764 4:...
1   1   78,80,85,86 0:0.050734 1:0.762265 2:0.754431 3:0.065255 4:...
2   2   457,577,579,640,939,1158    0:0.101468 1:0.138594 2:0.377215 3:0.130509 4:...
3   3   172,654,693,1704    0:0.186024 1:0.346484 2:0.141456 3:0.195764 4:...
4   4   403,508,1017,1052,1731,3183 0:0.135290 1:0.277187 2:0.141456 3:0.065255 4:...

I think the simple will remain for loops.

  1. First, select all keys from the given features .

    1. For all elements, we usestr.split and extract the first element.
    2. Then, because we only want unique keys, we use set . Then, we convert it back to list and sort the keys using sorted ( here some details if needed).

The first is sum up in:

keys = sorted(list(set([elt.split(':')[0] for l in features for elt in l])))
  1. Create an empty dict from the above keys and initialize all keys with an empty list:
data = {k:[] for k in keys}
  1. Iterate over all the features:

    1. Save all the key features visited in a seen variable
    2. Add all featured keys and values
    3. Complete the data with keys not in the current features
  2. Eventually, create the dataframe from out dict using the default constructor [ pd.DataFrame() ] ( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html ).

  3. Correctly format columns name using .columns and string formatting ( format ). Here are some good explanations.


Talked enough, here the full code + illustration :

features = [["0:0.084556", "1:0.138594", "2:0.094304"],
    ["0:0.101468", "4:0.138594", "5:0.377215"],
    ["0:0.135290", "2:0.277187", "3:0.141456"]
    ]

# Step 1
keys = sorted(list(set([elt.split(':')[0] for l in features for elt in l])))
print(keys)
# ['0', '1', '2', '3', '4', '5']

# Step 2
data = {k:[] for k in keys}
print(data)
# {'0': [], '1': [], '2': [], '3': [], '4': [], '5': []}

# Step 3
for sub in features:
    # Step 3.1
    seen = []
    # Step 3.2
    for l in sub:
        k2, v = l.split(":")        # Get key and value
        data[k2].append(float(v))   # Append current value to data
        seen.append(k2)             # Set the key as seen

    # Step 3.3
    for k in keys:                  # For all data keys
        if k not in seen:           # If not seen
            data[k].append(0)       # Add 0

print(data)
# {'0': [0.084556, 0.101468, 0.13529], 
#     '1': [0.138594, 0, 0], 
#     '2': [0.094304, 0,0.277187],
#     '3': [0, 0, 0.141456],
#     '4': [0, 0.138594, 0],
#     '5': [0, 0.377215, 0]
# }

# Step 4
df = pd.DataFrame(data)
print(df)
#           0         1         2         3         4         5
# 0  0.084556  0.138594  0.094304  0.000000  0.000000  0.000000
# 1  0.101468  0.000000  0.000000  0.000000  0.138594  0.377215
# 2  0.135290  0.000000  0.277187  0.141456  0.000000  0.000000

# Step 5
df.columns = ["f_{:04d}".format(int(val)) for val in df.columns]
print(df)
#      f_0000    f_0001    f_0002    f_0003    f_0004    f_0005
# 0  0.084556  0.138594  0.094304  0.000000  0.000000  0.000000
# 1  0.101468  0.000000  0.000000  0.000000  0.138594  0.377215
# 2  0.135290  0.000000  0.277187  0.141456  0.000000  0.000000

try this:

df = pd.DataFrame(data, columns = ['Column name 1'], ['column name 2'])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM