需要通过读取带有随机列的csv文件来创建Pandas数据框

Question

I have the following csv file with records: 我有以下带有记录的csv文件：

A 1, B 2, C 10, D 15 A 1，B 2，C 10，D 15
A 5, D 10, G 2 A 5，D 10，G 2
D 6, E 7 D 6，E 7
H 7, G 8 高7、8

My column headers/names are: A, B, C, D, E, F, G 我的列标题/名称是：A，B，C，D，E，F，G

So my initial dataframe after using "read_csv" becomes: 因此，使用“ read_csv”后，我的初始数据帧变为：

A     B     C      D       E      F      G   
A 1   B 2   C 10   D 15   NaN    NaN    NaN
A 5   D 10  G 2    NaN    NaN    NaN    NaN
D 6   E 7   NaN    NaN    NaN    NaN    NaN
H 7   G 8   NaN    NaN    NaN    NaN    Nan

The value can be separate into [column name][column value], so A 1 means col=A and value=1, and D 15 means col=D and value=15, etc... 该值可以分为[列名] [列值]，因此A 1表示col = A且value = 1，D 15表示col = D且value = 15，依此类推...

What I want is to assign the numeric value to the appropriate column based on the and have a dataframe that looks like this: 我想要的是基于将数字值分配给适当的列，并具有如下所示的数据框：

A     B     C      D       E      F      G   
A 1   B 2   C 10   D 15   NaN    NaN    NaN
A 5   Nan   NaN    D 10   NaN    NaN    G 2
NaN   NaN   NaN    D 6    E 7    NaN    NaN
NaN   NaN   NaN    NaN    NaN    NaN    G 8

And even better, just the values alone: 甚至更好的是，仅凭价值观：

A     B     C      D       E      F      G   
1     2     10     15      NaN    NaN    NaN
5     Nan   NaN    10      NaN    NaN    2
NaN   NaN   NaN    6       7      NaN    NaN
NaN   NaN   NaN    NaN     NaN    NaN    8

Answer 1

You can loop through rows with apply function( axis = 1 ) and construct a pandas series for each row based on the key value pairs after the splitting, and the newly constructed series will be automatically aligned by their index, just notice here there is no F column but an extra H , not sure if it is what you need. 您可以使用apply函数（ axis = 1 ）遍历各行，并在拆分后根据键值对为每行构造一个熊猫系列，新构建的系列将通过其索引自动对齐，只是请注意此处没有F列，但需要额外的H ，不确定是否是您所需要的。 But removing the H and adding an extra NaN F column should be straight forward: 但是，删除H并添加额外的NaN F列应该很简单：

df.apply(lambda r: pd.Series({x[0]: x[1] for x in r.str.split(' ') 
                                    if isinstance(x, list) and len(x) == 2}), axis = 1)


#     A   B   C   D   E   G   H
#0    1   2  10  15 NaN NaN NaN
#1    5 NaN NaN  10 NaN   2 NaN
#2  NaN NaN NaN   6   7 NaN NaN
#3  NaN NaN NaN NaN NaN   8   7

Answer 2

Apply solution: 应用解决方案：

Use split by whitespace, remove NaN rows by dropna , set_index and convert one column DataFrame to Series by DataFrame.squeeze . 使用split由空格，删除NaN按行dropna ， set_index和一个列转换DataFrame ，以Series由DataFrame.squeeze 。 Last reindex by new column names: 上次按新列名称reindex ：

print (df.apply(lambda x: x.str.split(expand=True)
                               .dropna()
                               .set_index(0)
                               .squeeze(), axis=1)
         .reindex(columns=list('ABCDEFGH')))

     A    B    C    D    E   F    G    H
0    1    2   10   15  NaN NaN  NaN  NaN
1    5  NaN  NaN   10  NaN NaN    2  NaN
2  NaN  NaN  NaN    6    7 NaN  NaN  NaN
3  NaN  NaN  NaN  NaN  NaN NaN    8    7

Stack solution: 堆栈解决方案：

Use stack for creating Series , split by whitespace and create new columns, append column with new column names ( A , B ...) to index by set_index , convert one column DataFrame to Series by DataFrame.squeeze , remove index values with old column names by reset_index , unstack , reindex by new column names (it add missing columns filled by NaN ),convert values to float by astype and last remove column name by rename_axis (new in pandas 0.18.0 ): 使用stack创建Series ， split的空白，创造新列，新列名（附加列A ， B ...），以index由set_index ，一个转换DataFrame ，以Series由DataFrame.squeeze ，删除与旧列索引值由名reset_index ， unstack ， reindex的新的列名（将其添加缺少的填充柱NaN ），将值转换为float通过astype和最后删除列名由rename_axis （在新pandas 0.18.0 ）：

print (df.stack()
         .str.split(expand=True)
         .set_index(0, append=True)
         .squeeze()
         .reset_index(level=1, drop=True)
         .unstack()
         .reindex(columns=list('ABCDEFGH'))
         .astype(float)
         .rename_axis(None, axis=1))

     A    B     C     D    E   F    G    H
0  1.0  2.0  10.0  15.0  NaN NaN  NaN  NaN
1  5.0  NaN   NaN  10.0  NaN NaN  2.0  NaN
2  NaN  NaN   NaN   6.0  7.0 NaN  NaN  NaN
3  NaN  NaN   NaN   NaN  NaN NaN  8.0  7.0

Answer 3

Here is the code: 这是代码：

res = pd.DataFrame(index=df.index, columns=list('ABCDEFGH'))

def classifier(row):
    cols = row.str.split().str[0].dropna().tolist()
    vals = row.str.split().str[1].dropna().tolist()
    res.loc[row.name, cols] = vals

df.apply(classifier, axis=1)

Input: 输入：

from io import StringIO
import pandas as pd
import numpy as np

data = """A 1, B 2, C 10, D 15
A 5, D 10, G 2
D 6, E 7
H 7, G 8"""

df = pd.read_csv(StringIO(data), header=None)
print("df:\n", df)

res = pd.DataFrame(index=df.index, columns=list('ABCDEFGH'))

def classifier(row):
    cols = row.str.split().str[0].dropna().tolist()
    vals = row.str.split().str[1].dropna().tolist()
    res.loc[row.name, cols] = vals
df.apply(classifier, axis=1)

print("\nres:\n", res)

Output: 输出：

df:
    0    1     2     3
0   A 1  B 2   C 10  D 15
1   A 5  D 10  G 2   NaN
2   D 6  E 7   NaN   NaN
3   H 7  G 8   NaN   NaN

res:
    A   B   C   D   E   F   G   H
0   1   2   10  15  NaN NaN NaN NaN
1   5   NaN NaN 10  NaN NaN 2   NaN
2   NaN NaN NaN 6   7   NaN NaN NaN
3   NaN NaN NaN NaN NaN NaN 8   7

需要通过读取带有随机列的csv文件来创建Pandas数据框

问题描述

3 个解决方案

解决方案1
2 2016-09-22 21:35:26

解决方案2
2 2016-09-23 08:22:23

解决方案3
0 2016-09-23 07:26:28

需要通过读取带有随机列的csv文件来创建Pandas数据框

问题描述

3 个解决方案

解决方案1 2 2016-09-22 21:35:26

解决方案2 2 2016-09-23 08:22:23

解决方案3 0 2016-09-23 07:26:28

解决方案1
2 2016-09-22 21:35:26

解决方案2
2 2016-09-23 08:22:23

解决方案3
0 2016-09-23 07:26:28