简体   繁体   English

dplyr使用动态列进行汇总

[英]dplyr summarise with dynamic columns

I'm trying to use dplyr against my postgres database and am conducting a simple function. 我正在尝试对我的postgres数据库使用dplyr并且正在执行一个简单的功能。 Everything works if I parse the column name directly, however I want to do this dynamically (ie sort through each column name from another dataframe 如果我直接解析列名,一切都有效,但是我想动态地这样做(即从另一个数据帧中对每个列名进行排序)

The problem I'm geeting is for the first two calculations, i'm getting the right results 我正在考虑的问题是前两个计算,我得到了正确的结果

Assume the first dynamic column is called "id" 假设第一个动态列被称为“id”

pull_table %>%
    summarise(
        row_count = n(), 
        distinct_count = n_distinct(var) , 
        distinct_count_minus_blank = n_distinct(ifelse(var=="",NA,var)), 
        maxvalue = max(var), 
        minvalue = min(var), 
        maxlength = max(length(var)), 
        minlen = min(length(var))
    )  %>% 
    show_query()

The wrong result I get is obvious when you see the sql - sometimes id has '' around it so it's calculating as a string: 当你看到sql时,我得到的错误结果是显而易见的 - 有时候id有'',所以它计算为一个字符串:

<SQL>
SELECT 
    COUNT(*) AS "row_count", 
    COUNT(DISTINCT id) AS "distinct_count", 
    COUNT(
        DISTINCT CASE 
            WHEN ('id' = '') THEN (NULL) 
            WHEN NOT('id' = '') THEN ('id') 
        END) AS "distinct_count_minus_blank", 
    MAX('id') AS "maxvalue", 
    MIN('id') AS "minvalue", 
    MAX(LENGTH('id')) AS "maxlength", 
    MIN(LENGTH('id')) AS "minlen"
FROM "table"

You can see from this output that sometimes the calculation is happening on the column, but sometimes it's just happening on the string "id". 您可以从此输出中看到有时计算正在列上发生,但有时它只发生在字符串“id”上。 Why is this and how can I fix it so it calculates on the actual column rather than the string? 为什么这个以及如何修复它以便计算实际列而不是字符串?

I think you should look at rlang::sym (which is imported by dplyr ). 我想你应该看一下rlang::sym (由dplyr导入)。

Assuming pull_table is a dataframe including id , some_numeric_variable and some_character_variable columns, you could write something like this: 假设pull_table是一个包含idsome_numeric_variablesome_character_variable列的数据帧,你可以写这样的东西:

xx = sym("id")
yy = sym("some_numeric_variable")
ww = sym("some_character_variable")
pull_table %>%
    summarise(
        row_count = n(), 
        distinct_count = n_distinct(!!xx) , 
        distinct_count_minus_blank = n_distinct(ifelse(var=="", NA, !!xx)), 
        maxvalue = max(!!yy ), 
        minvalue = min(!!yy ), 
        maxlength = max(length(!!ww)), 
        minlen = min(length(!!ww))
    )

The sym() function turn a string variable into a name , which can be unquoted inside dplyr functions with the !! sym()函数将一个string变量转换为一个name ,该name可以在dplyr函数中取消引用!! operator. 运营商。 If you want more information, please take a look at the quasiquotation doc or this tuto . 如果您想了解更多信息,请查看quasiquotation doc此tuto

Unfortunately, since I didn't have any tbl_sql at hand, I couldn't test it with show_query . 不幸的是,由于我tbl_sql没有任何tbl_sql ,我无法使用show_query测试。

Side advice: don't ever name your variables "var" as var is also the variance function. 侧面建议:不要将变量命名为“var”,因为var也是方差函数。 I pulled my hair off many times just because this had messed up with some packages or custom functions. 我把头发拉了很多次只是因为这弄乱了一些包或自定义功能。

I ended up solving it with dots 我最后用圆点解决了它

i.e.
pull_table %>%
select(var=(dots=column_i)) %>%
    summarise(
        row_count = n(), 
        distinct_count = n_distinct(var) , 
        distinct_count_minus_blank = n_distinct(ifelse(var=="",NA,var)), 
        maxvalue = max(var), 
        minvalue = min(var), 
        maxlength = max(length(var)), 
        minlen = min(length(var))
    )  %>% 
    show_query()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM