简体   繁体   English

使用列名将嵌套列表转换为 pandas dataframe

[英]Converting nested list to pandas dataframe with column names

Image of Original DataFrame原厂DataFrame图片

I have a nested list that looks something like this.我有一个看起来像这样的嵌套列表。

features = 
[['0:0.084556', '1:0.138594', '2:0.094304\n'],
 ['0:0.101468', '4:0.138594', '5:0.377215\n'],
 ['0:0.135290', '2:0.277187', '3:0.141456\n']
]

Each list within the nested list is a row that is comma separated.嵌套列表中的每个列表都是逗号分隔的行。 The left side of the ":" is the column name and the right side is the row value. “:”的左侧是列名,右侧是行值。

I want to transform this to a pandas data frame to look like this:我想将其转换为 pandas 数据框,如下所示:

  f_0000  |  f_0001  |  f_0002  |  f_0003  |  f_0004  | f_0005
---------------------------------------------------------------
 0.084556 | 0.138594 | 0.094304 | 0.000000 | 0.000000 | 0.000000
 0.101468 | 0.000000 | 0.000000 | 0.000000 | 0.138594 | 0.377215
 0.135290 | 0.000000 | 0.277187 | 0.141456 | 0.000000 | 0.000000

Can someone help me out on how to do this?有人可以帮我解决这个问题吗?

Original DF (it doesn't format correctly on pd.read_clipboard for me however..)原始 DF(但是它在 pd.read_clipboard 上的格式对我来说不正确..)

    ex_id   labels  features
0   0   446,521,1149,1249,1265,1482 0:0.084556 1:0.138594 2:0.094304 3:0.195764 4:...
1   1   78,80,85,86 0:0.050734 1:0.762265 2:0.754431 3:0.065255 4:...
2   2   457,577,579,640,939,1158    0:0.101468 1:0.138594 2:0.377215 3:0.130509 4:...
3   3   172,654,693,1704    0:0.186024 1:0.346484 2:0.141456 3:0.195764 4:...
4   4   403,508,1017,1052,1731,3183 0:0.135290 1:0.277187 2:0.141456 3:0.065255 4:...

I think the simple will remain for loops.我认为简单的将for循环。

  1. First, select all keys from the given features .首先, select 来自给定features的所有键。

    1. For all elements, we usestr.split and extract the first element.对于所有元素,我们使用str.split并提取第一个元素。
    2. Then, because we only want unique keys, we use set .然后,因为我们只想要唯一的键,我们使用set Then, we convert it back to list and sort the keys using sorted ( here some details if needed).然后,我们将其转换回list并使用 sorted 对键进行sorted (如果需要,这里有一些细节)。

The first is sum up in:首先总结为:

keys = sorted(list(set([elt.split(':')[0] for l in features for elt in l])))
  1. Create an empty dict from the above keys and initialize all keys with an empty list:从上述键创建一个空dict ,并用一个空列表初始化所有键:
data = {k:[] for k in keys}
  1. Iterate over all the features:遍历所有特征:

    1. Save all the key features visited in a seen variable将所有访问过的关键特征保存在一个seen的变量中
    2. Add all featured keys and values添加所有特色键和值
    3. Complete the data with keys not in the current features使用当前features中没有的键完成数据
  2. Eventually, create the dataframe from out dict using the default constructor [ pd.DataFrame() ] ( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html ). Eventually, create the dataframe from out dict using the default constructor [ pd.DataFrame() ] ( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html ).

  3. Correctly format columns name using .columns and string formatting ( format ).使用.columns和字符串格式 ( format ) 正确格式化列名称。 Here are some good explanations.这里有一些很好的解释。


Talked enough, here the full code + illustration :说得够多了,这里是完整的代码+插图

features = [["0:0.084556", "1:0.138594", "2:0.094304"],
    ["0:0.101468", "4:0.138594", "5:0.377215"],
    ["0:0.135290", "2:0.277187", "3:0.141456"]
    ]

# Step 1
keys = sorted(list(set([elt.split(':')[0] for l in features for elt in l])))
print(keys)
# ['0', '1', '2', '3', '4', '5']

# Step 2
data = {k:[] for k in keys}
print(data)
# {'0': [], '1': [], '2': [], '3': [], '4': [], '5': []}

# Step 3
for sub in features:
    # Step 3.1
    seen = []
    # Step 3.2
    for l in sub:
        k2, v = l.split(":")        # Get key and value
        data[k2].append(float(v))   # Append current value to data
        seen.append(k2)             # Set the key as seen

    # Step 3.3
    for k in keys:                  # For all data keys
        if k not in seen:           # If not seen
            data[k].append(0)       # Add 0

print(data)
# {'0': [0.084556, 0.101468, 0.13529], 
#     '1': [0.138594, 0, 0], 
#     '2': [0.094304, 0,0.277187],
#     '3': [0, 0, 0.141456],
#     '4': [0, 0.138594, 0],
#     '5': [0, 0.377215, 0]
# }

# Step 4
df = pd.DataFrame(data)
print(df)
#           0         1         2         3         4         5
# 0  0.084556  0.138594  0.094304  0.000000  0.000000  0.000000
# 1  0.101468  0.000000  0.000000  0.000000  0.138594  0.377215
# 2  0.135290  0.000000  0.277187  0.141456  0.000000  0.000000

# Step 5
df.columns = ["f_{:04d}".format(int(val)) for val in df.columns]
print(df)
#      f_0000    f_0001    f_0002    f_0003    f_0004    f_0005
# 0  0.084556  0.138594  0.094304  0.000000  0.000000  0.000000
# 1  0.101468  0.000000  0.000000  0.000000  0.138594  0.377215
# 2  0.135290  0.000000  0.277187  0.141456  0.000000  0.000000

try this:尝试这个:

df = pd.DataFrame(data, columns = ['Column name 1'], ['column name 2'])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM