简体   繁体   English

字典到pandas DataFrame的列表列表

[英]List of lists to dictionary to pandas DataFrame

I am trying to fit this data: 我正在尝试拟合以下数据:

[['Manufacturer: Hyundai',
  'Model: Tucson',
  'Mileage: 258000 km',
  'Registered: 07/2019'],
 ['Manufacturer: Mazda',
  'Model: 6',
  'Year: 2014',
  'Registered: 07/2019']]

to pandas DataFrame. 到熊猫DataFrame。

Not all labels are present in each record, for example some records have 'Mileage' and others don't and vice versa. 并非所有标签都出现在每个记录中,例如,某些记录具有“里程”,而有些则没有。 I have a total of 26 features and very few items have all of them. 我一共有26个功能,而几乎所有功能都很少。

I would like to construct pandas DataFrame that will hold features in columns and if feature doesn't exists than content should be 'NaN'. 我想构造将在列中包含要素的pandas DataFrame,如果要素不存在,则内容应为“ NaN”。

I have 我有

colnames=['Manufacturer', 'Model', 'Mileage', 'Registered', 'Year'...(all 26 features here)] 
df = pd.read_csv("./data/output.csv", sep=",", names=colnames, header=None)

Few first prerequisite columns are giving expected output but when it comes to optional features than missing data causing features after that to turn out under wrong columns. 很少有先决条件列能提供预期的输出,但是在涉及可选功能时,缺少数据会导致之后的功能在错误的列下出现。 Records are mapped correctly only if all features are present. 仅当所有功能均存在时,记录才能正确映射。

I forgot to mention that some features that are missing value also don't have ":" but are present in list. 我忘了提及某些缺少价值的功能,这些功能也没有“:”但出现在列表中。 So in this 2 cases: 因此,在这2种情况下:

  • 'Mileage', (value missing, but also ':' is missing) '里程',(缺少值,但也缺少':')
  • missing 'Mileage' from record altogheter 从唱片总谱中丢失了“里程”

assignment for both cases should be 'NaN'. 两种情况的分配均应为“ NaN”。

Use nested list comprehension for list of dictionaries and pass to DataFrame contructor, if same key is missing is added NaN : 使用嵌套列表DataFrame字典列表,如果缺少相同的键,则传递给DataFrame构造函数NaN

L = [['Manufacturer: Hyundai',
  'Model: Tucson',
  'Mileage: 258000 km',
  'Registered: 07/2019'],
 ['Manufacturer: Mazda',
  'Model: 6',
  'Year: 2014',
  'Registered: 07/2019']]

df = pd.DataFrame([dict(y.split(':') for y in x) for x in L])
print (df)
  Manufacturer     Mileage    Model Registered   Year
0      Hyundai   258000 km   Tucson    07/2019    NaN
1        Mazda         NaN        6    07/2019   2014

EDIT: You can use .split(maxsplit=1) for split by first whitespace: 编辑:您可以使用.split(maxsplit=1)来按第一个空格进行分割:

L = [['Manufacturer Hyundai',
  'Model Tucson',
  'Mileage 258000 km',
  'Registered 07/2019'],
 ['Manufacturer Mazda',
  'Model 6',
  'Year 2014',
  'Registered 07/2019']]


df = pd.DataFrame([dict(y.split(maxsplit=1) for y in x) for x in L])
print (df)

  Manufacturer    Mileage   Model Registered  Year
0      Hyundai  258000 km  Tucson    07/2019   NaN
1        Mazda        NaN       6    07/2019  2014

EDIT: 编辑:

L = [['Manufacturer  Hyundai',
  'Model  Tucson',
  'Mileage  258000 km',
  'Registered  07/2019'],
 ['Manufacturer  Mazda',
  'Model  6',
  'Year  2014',
  'Registered  07/2019',
  'Additional equipment aaa']]

words2 = ['Additional equipment']

L1 = []
for x in L:
    di = {}
    for y in x:
        for word in words2:
            if set(word.split(maxsplit=2)[:2]) < set(y.split()):
                i, j, k = y.split(maxsplit=2)
                di['_'.join([i, j])] = k
            else:
                i, j = y.split(maxsplit=1)
                di[i] = j
    L1.append(di)

df = pd.DataFrame(L1)
print (df)
  Additional_equipment Manufacturer    Mileage   Model Registered  Year
0                  NaN      Hyundai  258000 km  Tucson    07/2019   NaN
1                  aaa        Mazda        NaN       6    07/2019  2014

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM