简体   繁体   English

从 R 到 python pandas:按顺序创建 id 键系列

[英]from R to python pandas: create id key series in order on duplicates

I want to create an id key for a series on string that repeats in one column.我想为一列中重复的字符串系列创建一个 id 键。 The first ten rows should be id #1, the next ten #2 and so on.前十行应该是 id #1,接下来的十行应该是 #2,依此类推。 In R, this is simple and I get the expected result with dplyr.在 R 中,这很简单,我用 dplyr 得到了预期的结果。

R:回复:

library(tidyverse)

question = c('q1', 'q2', 'q3', 'q4', 'q5', 'q1', 'q2', 'q3', 'q4', 'q5', 'q1', 'q2', 'q3', 'q4', 'q5')
answer <- c('a1', 'a2', 'a3', 'a4', 'a5', 'a1', 'a2', 'a3', 'a4', 'a5', 'a1', 'a2', 'a3', 'a4', 'a5')

df <- data_frame(question, answer)

# A tibble: 15 x 2
   question answer
   <chr>    <chr> 
 1 q1       a1    
 2 q2       a2    
 3 q3       a3    
 4 q4       a4    
 5 q5       a5    
 6 q1       a1    
 7 q2       a2    
 8 q3       a3    
 9 q4       a4    
10 q5       a5    
11 q1       a1    
12 q2       a2    
13 q3       a3    
14 q4       a4    
15 q5       a5 

If we run just a group_by and a mutate to add a key to the series we get what I want:如果我们只运行一个 group_by 和一个 mutate 来为系列添加一个键,我们就会得到我想要的:

df2 <- df %>% 
  group_by(question) %>% 
  mutate(id = row_number())

# A tibble: 15 x 3
# Groups:   question [5]
   question answer    id
   <chr>    <chr>  <int>
 1 q1       a1         1
 2 q2       a2         1
 3 q3       a3         1
 4 q4       a4         1
 5 q5       a5         1
 6 q1       a1         2
 7 q2       a2         2
 8 q3       a3         2
 9 q4       a4         2
10 q5       a5         2

And I finish with:我以:

df2 <- df %>% 
  group_by(question) %>% 
  mutate(id = row_number()) %>% 
  spread(question, answer) 

# final table:
# A tibble: 3 x 6
      id    q1    q2    q3    q4    q5   
      <int> <chr> <chr> <chr> <chr> <chr>
    1     1 a1    a2    a3    a4    a5   
    2     2 a1    a2    a3    a4    a5   
    3     3 a1    a2    a3    a4    a5 

Python: Python:

Now, I can't figure out how to get the same result in Pandas.现在,我不知道如何在 Pandas 中获得相同的结果。 I have tried groupby and merge but no luck.我试过 groupby 和 merge 但没有运气。

import pandas as pd

data = {'question': ['question one', 'question two', 
                 'question three', 'question four', 
                 'question five', 'question one', 
                 'question two', 'question three', 
                 'question four', 'question five', 
                 'question one', 'question two', 
                 'question three', 'question four', 'question five'], 
    'answer':['answer one', 'answer two', 'answer three', 
              'answer four', 'answer five', 'answer one', 
              'answer two', 'answer three', 'answer four', 
              'answer five', 'answer one', 'answer two', 
              'answer three', 'answer four', 'answer five']}

df = pd.DataFrame(data)

Using merge and rest_index() it reorders the rows and assigns an id on a new order and that is not what I want:使用 merge 和 rest_index() 它重新排序行并在新订单上分配一个 id ,这不是我想要的:

df2 = df.merge(df.drop_duplicates('question').reset_index(), on='question')

          question      answer_x  index      answer_y
0     question one    answer one      0    answer one
1     question one    answer one      0    answer one
2     question one    answer one      0    answer one
3     question two    answer two      1    answer two
4     question two    answer two      1    answer two
5     question two    answer two      1    answer two

Using groupby I get a mess that is also not what I want:使用 groupby 我得到了一个混乱,这也不是我想要的:

df['id'] = df.groupby('question').ngroup()

          question        answer  id
0     question one    answer one   2
1     question two    answer two   4
2   question three  answer three   3
3    question four   answer four   1
4    question five   answer five   0
5     question one    answer one   2
6     question two    answer two   4
7   question three  answer three   3
8    question four   answer four   1
9    question five   answer five   0

How do I get the same output as with dplyr?如何获得与 dplyr 相同的输出? Edit: To add more details, I need the output to be like dplyr is giving me as this is part of an automated system.编辑:要添加更多详细信息,我需要输出像 dplyr 给我的一样,因为这是自动化系统的一部分。

ngroup is the number of the group, not the number within a group. ngroup是该组数量,而不是一组的编号。 As the docs explain, the complement of this is given by cumcount .正如文档所解释的,它的补充由cumcount给出。

Roughly, you can use assign for mutate , groupby/cumcount for row_number , and pivot for your spread :粗略地说,您可以对mutate使用assign ,对row_number使用groupby/cumcount ,对您的spread pivot

In [306]: df.assign(id=df.groupby("question").cumcount()).pivot("id", "question", "answer")
Out[306]: 
question  q1  q2  q3  q4  q5
id                          
0         a1  a2  a3  a4  a5
1         a1  a2  a3  a4  a5

and toss in a reset_index() if you want id to be a column.如果您希望id成为一列,则在reset_index()折腾。

Unfortunately, I guess to really match the expected output, we'd have to guarantee the order.不幸的是,我想要真正匹配预期的输出,我们必须保证顺序。 There are several open tickets on github about how the automatic sorting is inconvenient, but we can do it manually. github上有几个open ticket关于自动排序是如何不方便的,但是我们可以手动完成。 We'll switch back to the English text:我们将切换回英文文本:

In [327]: d2 = df.assign(id=df.groupby("question").cumcount()).pivot("id", "question", "answer")

In [328]: d2.reindex(df.question.drop_duplicates(), axis=1)
Out[328]: 
question question one question two question three question four question five
id                                                                           
0          answer one   answer two   answer three   answer four   answer five
1          answer one   answer two   answer three   answer four   answer five

With datar , you can replicate it easily as you did in R:使用datar ,您可以像在 R 中一样轻松地复制它:

>>> from datar.all import c, f, tibble, group_by, mutate, row_number, pivot_wider
>>> 
>>> question = c('q1', 'q2', 'q3', 'q4', 'q5', 'q1', 'q2', 'q3', 'q4', 'q5', 'q1', 'q2', 'q3', 'q4', 'q
5')
>>> answer = c('a1', 'a2', 'a3', 'a4', 'a5', 'a1', 'a2', 'a3', 'a4', 'a5', 'a1', 'a2', 'a3', 'a4', 'a5'
)
>>> 
>>> df = tibble(question, answer)
>>> df
   question answer
0        q1     a1
1        q2     a2
2        q3     a3
3        q4     a4
4        q5     a5
5        q1     a1
6        q2     a2
7        q3     a3
8        q4     a4
9        q5     a5
10       q1     a1
11       q2     a2
12       q3     a3
13       q4     a4
14       q5     a5

>>> df2 = (df >>
...   group_by(f.question) >>
...   mutate(id = row_number()))
>>> 
>>> df2
   question answer  id
0        q1     a1   1
1        q2     a2   1
2        q3     a3   1
3        q4     a4   1
4        q5     a5   1
5        q1     a1   2
6        q2     a2   2
7        q3     a3   2
8        q4     a4   2
9        q5     a5   2
10       q1     a1   3
11       q2     a2   3
12       q3     a3   3
13       q4     a4   3
14       q5     a5   3
[Groups: ['question'] (n=5)]

>>> df2 = (df >>
...   group_by(f.question) >>
...   mutate(id = row_number()) >>
...   pivot_wider(names_from=f.question, values_from=f.answer))
>>> 
>>> df2
   id  q1  q2  q3  q4  q5
0   1  a1  a2  a3  a4  a5
1   2  a1  a2  a3  a4  a5
2   3  a1  a2  a3  a4  a5

I am the author of the package.我是包的作者。 Feel free to submit issues if you have any questions.如果您有任何问题,请随时提交问题。

I know the question is about how to obtain a solution in python, still, I will leave this solution using data.table and reshape2 .我知道问题是关于如何在 python 中获得解决方案,但我仍然会使用data.tablereshape2离开这个解决方案。

library(data.table)
library(reshape2)
setDT(df)[,new := (1:.N), by = question]
dcast(df, new ~ question, value.var = "answer")

   new q1 q2 q3 q4 q5
1:   1 a1 a2 a3 a4 a5
2:   2 a1 a2 a3 a4 a5
3:   3 a1 a2 a3 a4 a5

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM