简体   繁体   English

Pandas 相当于 R/dplyr group_by 汇总连接

[英]Pandas equivalent of R/dplyr group_by summarise concatenation

I have an operation I need to translate from dplyr (and stringr ) in R to pandas in python. It's quite simple in R but I haven't been able to wrap my head around it in pandas. Basically, I need to group by one (or more) columns, and then concatenate the remaining columns together and collapse them by a delimiter.我有一个操作需要从 R 中的dplyr (和stringr )翻译成 python 中的pandas在 R 中它很简单,但我无法在 88106583817388 中完全理解它,基本上我需要一个组, (或更多)列,然后将剩余的列连接在一起并用分隔符折叠它们。 R has the nicely vectorized str_c function that does exactly what I want. R 具有很好的矢量化str_c function ,它完全符合我的要求。

Here's the R code:这是 R 代码:

library(tidyverse)
df <- as_tibble(structure(list(file = c(1, 1, 1, 2, 2, 2), marker = c("coi", "12s", "16s", "coi", "12s", "16s"), start = c(1, 22, 99, 12, 212, 199), end = c(15, 35, 102, 150, 350, 1102)), row.names = c(NA, -6L), class = "data.frame") )

df %>%
  group_by(file) %>%
  summarise(markers = str_c(marker,"[",start,":",end,"]",collapse="|"))
#> # A tibble: 2 × 2
#>    file markers                               
#>   <dbl> <chr>                                 
#> 1     1 coi[1:15]|12s[22:35]|16s[99:102]      
#> 2     2 coi[12:150]|12s[212:350]|16s[199:1102]

Here's the beginning of the python code.这是 python 代码的开头。 I assume there's some trickery with agg or transform but I'm not sure how to combine and join the multiple columns:我假设aggtransform有一些技巧,但我不确定如何组合和连接多列:

from io import StringIO
import pandas as pd

s = StringIO("""
file,marker,start,end
1.f,coi,1,15
1.f,12s,22,35
1.f,16s,99,102
2.f,coi,12,150
2.f,12s,212,350
2.f,16s,199,1102
""")

df = pd.read_csv(s)

# ... now what? ...
(df.astype(str)
   .assign(markers = lambda df: df.marker + "[" + (df.start + ":"+df.end) + "]")
   .groupby('file', as_index=False)
   .markers
   .agg("|".join)
)
 
  file                                 markers
0  1.f        coi[1:15]|12s[22:35]|16s[99:102]
1  2.f  coi[12:150]|12s[212:350]|16s[199:1102]

The idea is to combine the columns first before grouping and aggregatiing with python's str.join method这个想法是先组合列,然后再使用 python 的 str.join 方法进行分组和聚合

Create new column markers which concatenates marker and the last two columns separated by:创建连接标记和最后两列的新列标记,由:

Groupby by file and concatenate the new column markers.按文件分组并连接新的列标记。

df['markers']=df['marker']+'['+(df.astype(str).iloc[:,2:].agg(list,1).str.join(':'))+']'
df.groupby('file')['markers'].apply(lambda x: x.str.cat(sep='|')).to_frame()

                                 markers
file                                        
1.f         coi[1:15]|12s[22:35]|16s[99:102]
2.f   coi[12:150]|12s[212:350]|16s[199:1102]

You can do it with datar similarly as you do in R:您可以像在 R 中那样使用datar执行此操作:

>>> from datar.all import f, tibble, group_by, summarise, paste0
>>> 
>>> df = tibble(
...     file=[1, 1, 1, 2, 2, 2],
...     marker=["coi", "12s", "16s"] * 2,
...     start=[1, 22, 99, 12, 212, 199],
...     end=[15, 35, 102, 1150, 350, 1102],
... )
>>> (
...     df
...     >> group_by(f.file)
...     >> summarise(
...         markers=paste0(
...             f.marker, "[", f.start, ":", f.end, "]",
...             collapse="|",
...         )
...     )
... )
     file                                  markers
  <int64>                                 <object>
0       1         coi[1:15]|12s[22:35]|16s[99:102]
1       2  coi[12:1150]|12s[212:350]|16s[199:1102]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM