根据字符串中的值将字符串列表转换为pandas DataFrame

Question

I have a list of strings like the following: 我有如下字符串列表：

input = ["number__128_alg__hello_min_n__7_max_n__9_full_seq__True_random_color__False_shuffle_shapes__False.pkl", "k__9_window__10_number__128_overlap__True_alg__hi_min_n__7_max_n__9_full_seq_embedding__False_random_color__False_shuffle_shapes__False.pkl", "k__9_window__10_number__128_overlap__True_alg__what_random_color__False_shuffle_shapes__False.pkl"]

The format of these strings is parameter name followed by "__", then parameter value. 这些字符串的格式是参数名称，后跟“ __”，然后是参数值。 After the parameter value, there is a _ before the next parameter name. 在参数值之后，下一个参数名称前有一个_。 It is worth noting that some parameter names contain _ in them (such as "random_shape". Each string has different parameters, but there is overlap. Therefore, I would like to make a data frame with each parameter name as a column, and each row being the values corresponding to each element of the input list. If the particular value in the list does not have a parameter, the data frame should contain NA or NaN or anything. 值得注意的是，某些参数名称中包含_（例如“ random_shape”。每个字符串具有不同的参数，但存在重叠。因此，我想创建一个数据框架，其中每个参数名称均作为一列，每个row是与input列表的每个元素相对应的值，如果列表中的特定值没有参数，则数据框应包含NA或NaN或其他任何内容。

How can this be done? 如何才能做到这一点？

Thanks! 谢谢！

EDIT: If it cannot be done for the original list, what about: 编辑：如果无法完成原始列表，那么：

input = ["number__128_alg__hello_min.n__7_max.n__9_full.seq__True_random.color__False_shuffle.shapes__False.pkl", "k__9_window__10_number__128_overlap__True_alg__hi_min.n__7_max.n__9_full.seq__False_random.color__False_shuffle.shapes__False.pkl", "k__9_window__10_number__128_overlap__True_alg__what_random.color__False_shuffle.shapes__False.pkl"]

Answer 1

It is possible if you assume values can't contain the _ character (also assuming you want to discard the .pkl in the end). 如果您假设值不能包含_字符（也假设您最终要放弃.pkl ），则是可能的。

input = [
    "number__128_alg__hello_min_n__7_max_n__9_full_seq_embedding__True_random_color__False_shuffle_shapes__False.pkl", 
    "k__9_window__10_number__128_overlap__True_alg__hi_min_n__7_max_n__9_full_seq_embedding__False_random_color__False_shuffle_shapes__False.pkl", 
    "k__9_window__10_number__128_overlap__True_alg__what_random_color__False_shuffle_shapes__False.pkl"
]

A simple regular expression should do the trick: 一个简单的正则表达式应该可以解决问题：

import re
data = [dict(re.findall(r"([^_].*?)__([^_]+)", _[:-4])) for _ in input]
print(data)

Result: 结果：

[{'number': '128',
  'alg': 'hello',
  'min_n': '7',
  'max_n': '9',
  'full_seq_embedding': 'True',
  'random_color': 'False',
  'shuffle_shapes': 'False'},
 {'k': '9',
  'window': '10',
  'number': '128',
  'overlap': 'True',
  'alg': 'hi',
  'min_n': '7',
  'max_n': '9',
  'full_seq_embedding': 'False',
  'random_color': 'False',
  'shuffle_shapes': 'False'},
 {'k': '9',
  'window': '10',
  'number': '128',
  'overlap': 'True',
  'alg': 'what',
  'random_color': 'False',
  'shuffle_shapes': 'False'}]

As a dataframe: 作为数据框：

import pandas as pd
pd.DataFrame(data)

根据字符串中的值将字符串列表转换为pandas DataFrame

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-05-24 17:21:08

根据字符串中的值将字符串列表转换为pandas DataFrame

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-05-24 17:21:08

解决方案1
2 已采纳 2018-05-24 17:21:08