[英]from R to python pandas: create id key series in order on duplicates
I want to create an id key for a series on string that repeats in one column.我想为一列中重复的字符串系列创建一个 id 键。 The first ten rows should be id #1, the next ten #2 and so on.
前十行应该是 id #1,接下来的十行应该是 #2,依此类推。 In R, this is simple and I get the expected result with dplyr.
在 R 中,这很简单,我用 dplyr 得到了预期的结果。
R:回复:
library(tidyverse)
question = c('q1', 'q2', 'q3', 'q4', 'q5', 'q1', 'q2', 'q3', 'q4', 'q5', 'q1', 'q2', 'q3', 'q4', 'q5')
answer <- c('a1', 'a2', 'a3', 'a4', 'a5', 'a1', 'a2', 'a3', 'a4', 'a5', 'a1', 'a2', 'a3', 'a4', 'a5')
df <- data_frame(question, answer)
# A tibble: 15 x 2
question answer
<chr> <chr>
1 q1 a1
2 q2 a2
3 q3 a3
4 q4 a4
5 q5 a5
6 q1 a1
7 q2 a2
8 q3 a3
9 q4 a4
10 q5 a5
11 q1 a1
12 q2 a2
13 q3 a3
14 q4 a4
15 q5 a5
If we run just a group_by and a mutate to add a key to the series we get what I want:如果我们只运行一个 group_by 和一个 mutate 来为系列添加一个键,我们就会得到我想要的:
df2 <- df %>%
group_by(question) %>%
mutate(id = row_number())
# A tibble: 15 x 3
# Groups: question [5]
question answer id
<chr> <chr> <int>
1 q1 a1 1
2 q2 a2 1
3 q3 a3 1
4 q4 a4 1
5 q5 a5 1
6 q1 a1 2
7 q2 a2 2
8 q3 a3 2
9 q4 a4 2
10 q5 a5 2
And I finish with:我以:
df2 <- df %>%
group_by(question) %>%
mutate(id = row_number()) %>%
spread(question, answer)
# final table:
# A tibble: 3 x 6
id q1 q2 q3 q4 q5
<int> <chr> <chr> <chr> <chr> <chr>
1 1 a1 a2 a3 a4 a5
2 2 a1 a2 a3 a4 a5
3 3 a1 a2 a3 a4 a5
Python: Python:
Now, I can't figure out how to get the same result in Pandas.现在,我不知道如何在 Pandas 中获得相同的结果。 I have tried groupby and merge but no luck.
我试过 groupby 和 merge 但没有运气。
import pandas as pd
data = {'question': ['question one', 'question two',
'question three', 'question four',
'question five', 'question one',
'question two', 'question three',
'question four', 'question five',
'question one', 'question two',
'question three', 'question four', 'question five'],
'answer':['answer one', 'answer two', 'answer three',
'answer four', 'answer five', 'answer one',
'answer two', 'answer three', 'answer four',
'answer five', 'answer one', 'answer two',
'answer three', 'answer four', 'answer five']}
df = pd.DataFrame(data)
Using merge and rest_index() it reorders the rows and assigns an id on a new order and that is not what I want:使用 merge 和 rest_index() 它重新排序行并在新订单上分配一个 id ,这不是我想要的:
df2 = df.merge(df.drop_duplicates('question').reset_index(), on='question')
question answer_x index answer_y
0 question one answer one 0 answer one
1 question one answer one 0 answer one
2 question one answer one 0 answer one
3 question two answer two 1 answer two
4 question two answer two 1 answer two
5 question two answer two 1 answer two
Using groupby I get a mess that is also not what I want:使用 groupby 我得到了一个混乱,这也不是我想要的:
df['id'] = df.groupby('question').ngroup()
question answer id
0 question one answer one 2
1 question two answer two 4
2 question three answer three 3
3 question four answer four 1
4 question five answer five 0
5 question one answer one 2
6 question two answer two 4
7 question three answer three 3
8 question four answer four 1
9 question five answer five 0
How do I get the same output as with dplyr?如何获得与 dplyr 相同的输出? Edit: To add more details, I need the output to be like dplyr is giving me as this is part of an automated system.
编辑:要添加更多详细信息,我需要输出像 dplyr 给我的一样,因为这是自动化系统的一部分。
ngroup
is the number of the group, not the number within a group. ngroup
是该组的数量,而不是一组内的编号。 As the docs explain, the complement of this is given by cumcount
.正如文档所解释的,它的补充由
cumcount
给出。
Roughly, you can use assign
for mutate
, groupby/cumcount
for row_number
, and pivot
for your spread
:粗略地说,您可以对
mutate
使用assign
,对row_number
使用groupby/cumcount
,对您的spread
pivot
:
In [306]: df.assign(id=df.groupby("question").cumcount()).pivot("id", "question", "answer")
Out[306]:
question q1 q2 q3 q4 q5
id
0 a1 a2 a3 a4 a5
1 a1 a2 a3 a4 a5
and toss in a reset_index()
if you want id
to be a column.如果您希望
id
成为一列,则在reset_index()
折腾。
Unfortunately, I guess to really match the expected output, we'd have to guarantee the order.不幸的是,我想要真正匹配预期的输出,我们必须保证顺序。 There are several open tickets on github about how the automatic sorting is inconvenient, but we can do it manually.
github上有几个open ticket关于自动排序是如何不方便的,但是我们可以手动完成。 We'll switch back to the English text:
我们将切换回英文文本:
In [327]: d2 = df.assign(id=df.groupby("question").cumcount()).pivot("id", "question", "answer")
In [328]: d2.reindex(df.question.drop_duplicates(), axis=1)
Out[328]:
question question one question two question three question four question five
id
0 answer one answer two answer three answer four answer five
1 answer one answer two answer three answer four answer five
With datar
, you can replicate it easily as you did in R:使用
datar
,您可以像在 R 中一样轻松地复制它:
>>> from datar.all import c, f, tibble, group_by, mutate, row_number, pivot_wider
>>>
>>> question = c('q1', 'q2', 'q3', 'q4', 'q5', 'q1', 'q2', 'q3', 'q4', 'q5', 'q1', 'q2', 'q3', 'q4', 'q
5')
>>> answer = c('a1', 'a2', 'a3', 'a4', 'a5', 'a1', 'a2', 'a3', 'a4', 'a5', 'a1', 'a2', 'a3', 'a4', 'a5'
)
>>>
>>> df = tibble(question, answer)
>>> df
question answer
0 q1 a1
1 q2 a2
2 q3 a3
3 q4 a4
4 q5 a5
5 q1 a1
6 q2 a2
7 q3 a3
8 q4 a4
9 q5 a5
10 q1 a1
11 q2 a2
12 q3 a3
13 q4 a4
14 q5 a5
>>> df2 = (df >>
... group_by(f.question) >>
... mutate(id = row_number()))
>>>
>>> df2
question answer id
0 q1 a1 1
1 q2 a2 1
2 q3 a3 1
3 q4 a4 1
4 q5 a5 1
5 q1 a1 2
6 q2 a2 2
7 q3 a3 2
8 q4 a4 2
9 q5 a5 2
10 q1 a1 3
11 q2 a2 3
12 q3 a3 3
13 q4 a4 3
14 q5 a5 3
[Groups: ['question'] (n=5)]
>>> df2 = (df >>
... group_by(f.question) >>
... mutate(id = row_number()) >>
... pivot_wider(names_from=f.question, values_from=f.answer))
>>>
>>> df2
id q1 q2 q3 q4 q5
0 1 a1 a2 a3 a4 a5
1 2 a1 a2 a3 a4 a5
2 3 a1 a2 a3 a4 a5
I am the author of the package.我是包的作者。 Feel free to submit issues if you have any questions.
如果您有任何问题,请随时提交问题。
I know the question is about how to obtain a solution in python, still, I will leave this solution using data.table
and reshape2
.我知道问题是关于如何在 python 中获得解决方案,但我仍然会使用
data.table
和reshape2
离开这个解决方案。
library(data.table)
library(reshape2)
setDT(df)[,new := (1:.N), by = question]
dcast(df, new ~ question, value.var = "answer")
new q1 q2 q3 q4 q5
1: 1 a1 a2 a3 a4 a5
2: 2 a1 a2 a3 a4 a5
3: 3 a1 a2 a3 a4 a5
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.