[英]Reproduce R's summarise/reshape result in Python
我想在使用 Python 的melt
函數時重現行為或 R 的aggregate
函數。
R中的數據如下:
library("dplyr")
data <- summarise(group_by(table, project, resourcetype),
count = n_distinct(resource_id))
project resourcetype count
<fctr> <fctr> <int>
1 1000001 O 7
2 1000002 O 6
3 1000003 O 18
4 1000004 C 1
5 1000004 I 1
6 1000004 O 19
7 1000005 I 2
8 1000005 O 11
9 1000006 O 4
reshape(as.data.frame(data),
timevar = "resourcetype",
idvar = "project",
direction = "wide",
sep = "_")
project count_O count_C count_I
1 1000001 7 NA NA
2 1000002 6 NA NA
3 1000003 18 NA NA
4 1000004 19 1 1
7 1000005 11 NA 2
9 1000006 4 NA NA
現在,在 Python 中我得到:
import pandas as pd
data = table.groupby(['project', 'resourcetype'], as_index=False)\
.agg({'resource_id': {'count': 'nunique'}})
project resourcetype resource_id
count
0 1000001 O 7
1 1000002 O 6
2 1000003 O 18
3 1000004 C 1
4 1000005 I 1
5 1000006 O 19
6 1000007 I 2
7 1000008 O 11
8 1000009 O 4
我有as_index=False
我希望用as_index=False
消除它。 我在最后一列中有resource_id
和count
,我想像在 R 中一樣count
。
我試圖在 Python 中使用melt
函數,但無濟於事。
編輯:原始數據是一個 2000 行 19 列的表格。
Edit2 :關於多索引問題。
table.groupby(['project', 'resourcetype'])\
.agg({'resource_id': {'count': 'nunique'}}).reset_index()
project resourcetype resource_id
count
0 1000001 O 7
table.groupby(['project', 'resourcetype'])\
.agg({'resource_id': {'count': 'nunique'}})
resource_id
count
project resourcetype
1000001 O 7
我想得到的是:
project resourcetype count
0 1000001 O 7
考慮更新列名的 pandas pivot
:
from io import StringIO
import pandas as pd
# REPRODUCIBLE EXAMPLE
text ="""
project resourcetype count
1000001 O 7
1000002 O 6
1000003 O 18
1000004 C 1
1000004 I 1
1000004 O 19
1000005 I 2
1000005 O 11
1000006 O 4
"""
df = pd.read_table(StringIO(text), sep="\s+")
# PIVOTED DATA
pvtdf = df.pivot(index='project', columns='resourcetype', values='count')
# RENAME COLUMNS WITH RESET_INDEX
pvtdf.columns = ['count_'+str(i) for i in pvtdf.columns.values]
pvtdf = pvtdf.reset_index()
print(pvtdf)
# project count_C count_I count_O
# 0 1000001 NaN NaN 7.0
# 1 1000002 NaN NaN 6.0
# 2 1000003 NaN NaN 18.0
# 3 1000004 1.0 1.0 19.0
# 4 1000005 NaN 2.0 11.0
# 5 1000006 NaN NaN 4.0
顯而易見的解決方案:)
import pandas
import rpy2
from rpy2 import robjects
from rpy2.robjects import pandas2ri
rdf = robjects.r('''
data <- summarise(group_by(table, project, resourcetype),
count = n_distinct(resource_id))
data <- summarise(group_by(table, project, resourcetype),
count = n_distinct(resource_id))
reshape(as.data.frame(data),
timevar = "resourcetype",
idvar = "project",
direction = "wide",
sep = "_")
data[is.na(data)] <- NaN
data
''')
pd_df = pandas2ri.ri2py_dataframe(rdf)
除了重塑,我們還可以使用tidyr
的pivot_wider
:
r$> library(tidyr)
r$> library(dplyr)
r$> data = tribble(
~project, ~resourcetype, ~count,
1000001, "O", 7,
1000002, "O", 6,
1000003, "O", 18,
1000004, "C", 1,
1000004, "I", 1,
1000004, "O", 19,
1000005, "I", 2,
1000005, "O", 11,
1000006, "O", 4
)
r$> pivot_wider(
data,
names_from=resourcetype,
values_from=count,
names_glue="count_{.resourcetype}"
)
# A tibble: 6 x 4
project count_O count_C count_I
<dbl> <dbl> <dbl> <dbl>
1 1000001 7 NA NA
2 1000002 6 NA NA
3 1000003 18 NA NA
4 1000004 19 1 1
5 1000005 11 NA 2
6 1000006 4 NA NA
在 python 中,您可以使用datar
復制它:
>>> from datar.all import f, tribble, pivot_wider
>>>
>>> df = tribble(
... f.project, f.resourcetype, f.count,
... 1000001, "O", 7,
... 1000002, "O", 6,
... 1000003, "O", 18,
... 1000004, "C", 1,
... 1000004, "I", 1,
... 1000004, "O", 19,
... 1000005, "I", 2,
... 1000005, "O", 11,
... 1000006, "O", 4,
... )
>>> df >> pivot_wider(
... names_from=f.resourcetype,
... names_glue="count_{resourcetype}",
... values_from=f.count,
... )
project count_C count_I count_O
<int64> <float64> <float64> <float64>
0 1000001 NaN NaN 7.0
1 1000002 NaN NaN 6.0
2 1000003 NaN NaN 18.0
3 1000004 1.0 1.0 19.0
4 1000005 NaN 2.0 11.0
5 1000006 NaN NaN 4.0
免責聲明:我是datar
包的作者。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.