[英]Need to create a Pandas dataframe by reading csv file with random columns
I have the following csv file with records: 我有以下带有记录的csv文件:
My column headers/names are: A, B, C, D, E, F, G 我的列标题/名称是:A,B,C,D,E,F,G
So my initial dataframe after using "read_csv" becomes: 因此,使用“ read_csv”后,我的初始数据帧变为:
A B C D E F G
A 1 B 2 C 10 D 15 NaN NaN NaN
A 5 D 10 G 2 NaN NaN NaN NaN
D 6 E 7 NaN NaN NaN NaN NaN
H 7 G 8 NaN NaN NaN NaN Nan
The value can be separate into [column name][column value], so A 1 means col=A and value=1, and D 15 means col=D and value=15, etc... 该值可以分为[列名] [列值],因此A 1表示col = A且value = 1,D 15表示col = D且value = 15,依此类推...
What I want is to assign the numeric value to the appropriate column based on the and have a dataframe that looks like this: 我想要的是基于将数字值分配给适当的列,并具有如下所示的数据框:
A B C D E F G
A 1 B 2 C 10 D 15 NaN NaN NaN
A 5 Nan NaN D 10 NaN NaN G 2
NaN NaN NaN D 6 E 7 NaN NaN
NaN NaN NaN NaN NaN NaN G 8
And even better, just the values alone: 甚至更好的是,仅凭价值观:
A B C D E F G
1 2 10 15 NaN NaN NaN
5 Nan NaN 10 NaN NaN 2
NaN NaN NaN 6 7 NaN NaN
NaN NaN NaN NaN NaN NaN 8
You can loop through rows with apply
function( axis = 1
) and construct a pandas series for each row based on the key value pairs after the splitting, and the newly constructed series will be automatically aligned by their index, just notice here there is no F
column but an extra H
, not sure if it is what you need. 您可以使用apply
函数( axis = 1
)遍历各行,并在拆分后根据键值对为每行构造一个熊猫系列,新构建的系列将通过其索引自动对齐,只是请注意此处没有F
列,但需要额外的H
,不确定是否是您所需要的。 But removing the H
and adding an extra NaN F
column should be straight forward: 但是,删除H
并添加额外的NaN F
列应该很简单:
df.apply(lambda r: pd.Series({x[0]: x[1] for x in r.str.split(' ')
if isinstance(x, list) and len(x) == 2}), axis = 1)
# A B C D E G H
#0 1 2 10 15 NaN NaN NaN
#1 5 NaN NaN 10 NaN 2 NaN
#2 NaN NaN NaN 6 7 NaN NaN
#3 NaN NaN NaN NaN NaN 8 7
Apply solution: 应用解决方案:
Use split
by whitespace, remove NaN
rows by dropna
, set_index
and convert one column DataFrame
to Series
by DataFrame.squeeze
. 使用split
由空格,删除NaN
按行dropna
, set_index
和一个列转换DataFrame
,以Series
由DataFrame.squeeze
。 Last reindex
by new column names: 上次按新列名称reindex
:
print (df.apply(lambda x: x.str.split(expand=True)
.dropna()
.set_index(0)
.squeeze(), axis=1)
.reindex(columns=list('ABCDEFGH')))
A B C D E F G H
0 1 2 10 15 NaN NaN NaN NaN
1 5 NaN NaN 10 NaN NaN 2 NaN
2 NaN NaN NaN 6 7 NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN 8 7
Stack solution: 堆栈解决方案:
Use stack
for creating Series
, split
by whitespace and create new columns, append column with new column names ( A
, B
...) to index
by set_index
, convert one column DataFrame
to Series
by DataFrame.squeeze
, remove index values with old column names by reset_index
, unstack
, reindex
by new column names (it add missing columns filled by NaN
),convert values to float
by astype
and last remove column name by rename_axis
(new in pandas
0.18.0
): 使用stack
创建Series
, split
的空白,创造新列,新列名(附加列A
, B
...),以index
由set_index
,一个转换DataFrame
,以Series
由DataFrame.squeeze
,删除与旧列索引值由名reset_index
, unstack
, reindex
的新的列名(将其添加缺少的填充柱NaN
),将值转换为float
通过astype
和最后删除列名由rename_axis
(在新pandas
0.18.0
):
print (df.stack()
.str.split(expand=True)
.set_index(0, append=True)
.squeeze()
.reset_index(level=1, drop=True)
.unstack()
.reindex(columns=list('ABCDEFGH'))
.astype(float)
.rename_axis(None, axis=1))
A B C D E F G H
0 1.0 2.0 10.0 15.0 NaN NaN NaN NaN
1 5.0 NaN NaN 10.0 NaN NaN 2.0 NaN
2 NaN NaN NaN 6.0 7.0 NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN 8.0 7.0
Here is the code: 这是代码:
res = pd.DataFrame(index=df.index, columns=list('ABCDEFGH'))
def classifier(row):
cols = row.str.split().str[0].dropna().tolist()
vals = row.str.split().str[1].dropna().tolist()
res.loc[row.name, cols] = vals
df.apply(classifier, axis=1)
Input: 输入:
from io import StringIO
import pandas as pd
import numpy as np
data = """A 1, B 2, C 10, D 15
A 5, D 10, G 2
D 6, E 7
H 7, G 8"""
df = pd.read_csv(StringIO(data), header=None)
print("df:\n", df)
res = pd.DataFrame(index=df.index, columns=list('ABCDEFGH'))
def classifier(row):
cols = row.str.split().str[0].dropna().tolist()
vals = row.str.split().str[1].dropna().tolist()
res.loc[row.name, cols] = vals
df.apply(classifier, axis=1)
print("\nres:\n", res)
Output: 输出:
df:
0 1 2 3
0 A 1 B 2 C 10 D 15
1 A 5 D 10 G 2 NaN
2 D 6 E 7 NaN NaN
3 H 7 G 8 NaN NaN
res:
A B C D E F G H
0 1 2 10 15 NaN NaN NaN NaN
1 5 NaN NaN 10 NaN NaN 2 NaN
2 NaN NaN NaN 6 7 NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN 8 7
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.