[英]List of lists to dictionary to pandas DataFrame
I am trying to fit this data: 我正在尝试拟合以下数据:
[['Manufacturer: Hyundai',
'Model: Tucson',
'Mileage: 258000 km',
'Registered: 07/2019'],
['Manufacturer: Mazda',
'Model: 6',
'Year: 2014',
'Registered: 07/2019']]
to pandas DataFrame. 到熊猫DataFrame。
Not all labels are present in each record, for example some records have 'Mileage' and others don't and vice versa. 并非所有标签都出现在每个记录中,例如,某些记录具有“里程”,而有些则没有。 I have a total of 26 features and very few items have all of them. 我一共有26个功能,而几乎所有功能都很少。
I would like to construct pandas DataFrame that will hold features in columns and if feature doesn't exists than content should be 'NaN'. 我想构造将在列中包含要素的pandas DataFrame,如果要素不存在,则内容应为“ NaN”。
I have 我有
colnames=['Manufacturer', 'Model', 'Mileage', 'Registered', 'Year'...(all 26 features here)]
df = pd.read_csv("./data/output.csv", sep=",", names=colnames, header=None)
Few first prerequisite columns are giving expected output but when it comes to optional features than missing data causing features after that to turn out under wrong columns. 很少有先决条件列能提供预期的输出,但是在涉及可选功能时,缺少数据会导致之后的功能在错误的列下出现。 Records are mapped correctly only if all features are present. 仅当所有功能均存在时,记录才能正确映射。
I forgot to mention that some features that are missing value also don't have ":" but are present in list. 我忘了提及某些缺少价值的功能,这些功能也没有“:”但出现在列表中。 So in this 2 cases: 因此,在这2种情况下:
assignment for both cases should be 'NaN'. 两种情况的分配均应为“ NaN”。
Use nested list comprehension for list of dictionaries and pass to DataFrame
contructor, if same key is missing is added NaN
: 使用嵌套列表DataFrame
字典列表,如果缺少相同的键,则传递给DataFrame
构造函数NaN
:
L = [['Manufacturer: Hyundai',
'Model: Tucson',
'Mileage: 258000 km',
'Registered: 07/2019'],
['Manufacturer: Mazda',
'Model: 6',
'Year: 2014',
'Registered: 07/2019']]
df = pd.DataFrame([dict(y.split(':') for y in x) for x in L])
print (df)
Manufacturer Mileage Model Registered Year
0 Hyundai 258000 km Tucson 07/2019 NaN
1 Mazda NaN 6 07/2019 2014
EDIT: You can use .split(maxsplit=1)
for split by first whitespace: 编辑:您可以使用.split(maxsplit=1)
来按第一个空格进行分割:
L = [['Manufacturer Hyundai',
'Model Tucson',
'Mileage 258000 km',
'Registered 07/2019'],
['Manufacturer Mazda',
'Model 6',
'Year 2014',
'Registered 07/2019']]
df = pd.DataFrame([dict(y.split(maxsplit=1) for y in x) for x in L])
print (df)
Manufacturer Mileage Model Registered Year
0 Hyundai 258000 km Tucson 07/2019 NaN
1 Mazda NaN 6 07/2019 2014
EDIT: 编辑:
L = [['Manufacturer Hyundai',
'Model Tucson',
'Mileage 258000 km',
'Registered 07/2019'],
['Manufacturer Mazda',
'Model 6',
'Year 2014',
'Registered 07/2019',
'Additional equipment aaa']]
words2 = ['Additional equipment']
L1 = []
for x in L:
di = {}
for y in x:
for word in words2:
if set(word.split(maxsplit=2)[:2]) < set(y.split()):
i, j, k = y.split(maxsplit=2)
di['_'.join([i, j])] = k
else:
i, j = y.split(maxsplit=1)
di[i] = j
L1.append(di)
df = pd.DataFrame(L1)
print (df)
Additional_equipment Manufacturer Mileage Model Registered Year
0 NaN Hyundai 258000 km Tucson 07/2019 NaN
1 aaa Mazda NaN 6 07/2019 2014
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.