[英]Pandas/Python Converting two columns to matrix. Column names in matrix
I can successfully convert the two columns to matrix using the following commands. 我可以使用以下命令将两列成功转换为矩阵。
dfb = datab.parse("a")
dfb
Name Product
0 Mike Apple,pear
1 John Orange,Banana
2 Bob Banana
3 Connie Pear
pd.get_dummies(dfb.Product).groupby(dfb.Name).apply(max)
Apple,pear Banana Orange,Banana Pear
Name
Bob 0 1 0 0
Connie 0 0 0 1
John 0 0 1 0
Mike 1 0 0 0
However, the matrix that I want to have is the following. 但是,我想要的矩阵如下。
Apple Banana Orange Pear
Name
Bob 0 1 0 0
Connie 0 0 0 1
John 0 1 1 0
Mike 1 0 0 1
1. 1。
You need set_index
with get_dummies
: 你需要set_index
和get_dummies
:
df = dfb.set_index('Name').Product.str.get_dummies(',')
print (df)
Apple Banana Orange Pear
Name
Mike 1 0 0 1
John 0 1 1 0
Bob 0 1 0 0
Connie 0 0 0 1
2. 2。
Solution with pandas.get_dummies
with split
for new DataFarme
, last groupby
by columns, so axis=1
and level=0
and aggregate max
: 溶液与pandas.get_dummies
与split
为新DataFarme
,最后groupby
通过列,所以axis=1
和level=0
和骨料max
:
dfb = dfb.set_index('Name')
df = pd.get_dummies(dfb.Product.str.split(',', expand=True), prefix='', prefix_sep='')
.groupby(axis=1, level=0).max()
print (df)
Apple Banana Orange Pear
Name
Mike 1 0 0 1
John 0 1 1 0
Bob 0 1 0 0
Connie 0 0 0 1
3. 3。
Solution with split
and MultiLabelBinarizer
: split
和MultiLabelBinarizer
解决方案:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(dfb.Product.str.split(',')),
columns=mlb.classes_,
index=dfb.Name)
print (df)
Apple Banana Orange Pear
Name
Mike 1 0 0 1
John 0 1 1 0
Bob 0 1 0 0
Connie 0 0 0 1
If duplicates in column Name
: 如果列Name
重复:
df = df.groupby('Name').max()
print (df)
Apple Banana Orange Pear
Name
Bob 0 1 0 0
Connie 0 0 0 1
John 0 1 1 0
Mike 1 0 0 1
See Timing Below 见下面的时间
Option 1 选项1
pir0 = lambda dfb: pd.get_dummies(dfb.Name).T.dot(
dfb.Product.str.title().str.get_dummies(','))
pir0(dfb)
Apple Banana Orange Pear
Bob 0 1 0 0
Connie 0 0 0 1
John 0 1 1 0
Mike 1 0 0 1
Option 2 选项2
from cytoolz import concat
def pir1(dfb):
f0, u0 = pd.factorize(dfb.Name.values)
p = [x.title().split(',') for x in dfb.Product.values.tolist()]
l = [len(y) for y in p]
f1, u1 = pd.factorize(list(concat(p)))
n, m = u0.size, u1.size
return pd.DataFrame(
np.bincount(f0.repeat(l) * m + f1, minlength=n * m).reshape(n, m),
u0, u1)
pir1(dfb)
Apple Pear Orange Banana
Mike 1 1 0 0
John 0 0 1 1
Bob 0 0 0 1
Connie 0 1 0 0
Option 3 选项3
def pir2(dfb):
f0, u0 = pd.factorize(dfb.Name.values)
p = [x.title().split(',') for x in dfb.Product.values.tolist()]
l = [len(y) for y in p]
f1, u1 = pd.factorize(list(concat(p)))
n, m = u0.size, u1.size
a = np.zeros((n, m), dtype=int)
a[f0.repeat(l), f1] = 1
return pd.DataFrame(a, u0, u1)
pir2(dfb)
Apple Pear Orange Banana
Mike 1 1 0 0
John 0 0 1 1
Bob 0 0 0 1
Connie 0 1 0 0
Timing 定时
Code Below 代码如下
results = pd.DataFrame(
index=pd.Index([10, 30, 100, 300, 1000, 3000, 10000, 30000]),
columns='pir0 pir1 pir2 jez0 jez1 jez2'.split()
)
for i in results.index:
d = pd.concat([dfb] * i, ignore_index=True)
for j in results.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
results.set_value(i, j, timeit(stmt, setp, number=20))
ax = results.plot(loglog=True)
ax.legend(ncol=2)
pir0 = lambda dfb: pd.get_dummies(dfb.Name).T.dot(dfb.Product.str.title().str.get_dummies(',')).astype(bool).astype(int)
from cytoolz import concat
def pir1(dfb):
f0, u0 = pd.factorize(dfb.Name.values)
p = [x.title().split(',') for x in dfb.Product.values.tolist()]
l = [len(y) for y in p]
f1, u1 = pd.factorize(list(concat(p)))
n, m = u0.size, u1.size
return pd.DataFrame(
np.bincount(f0.repeat(l) * m + f1, minlength=n * m).reshape(n, m).astype(bool).astype(int),
u0, u1)
def pir2(dfb):
f0, u0 = pd.factorize(dfb.Name.values)
p = [x.title().split(',') for x in dfb.Product.values.tolist()]
l = [len(y) for y in p]
f1, u1 = pd.factorize(list(concat(p)))
n, m = u0.size, u1.size
a = np.zeros((n, m), dtype=int)
a[f0.repeat(l), f1] = 1
return pd.DataFrame(a, u0, u1)
jez0 = lambda dfb: dfb.set_index('Name').Product.str.get_dummies(',')
jez1 = lambda dfb: pd.get_dummies(
dfb.set_index('Name').Product.str.split(',', expand=True),
prefix='', prefix_sep='').groupby(axis=1, level=0).sum()
def jez2(dfb):
mlb = MultiLabelBinarizer()
return pd.DataFrame(
mlb.fit_transform(dfb.Product.str.split(',')),
dfb.Name, mlb.classes_
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.