使用dplyrXdf将连续变量转换为类别

Question

I'm trying to perform some initial exploration of some data. 我正在尝试对某些数据进行一些初步探索。 I am busy analysing one-ways of continuous variables by converting them to factors and calculating frequencies by bands. 我正在忙于分析连续变量的单向方法，方法是将它们转换为因子并按频段计算频率。

I would like to do this with dplyrXdf but it doesn't seem to work the same as normal dplyr for what I'm attempting 我想用dplyrXdf做到这一点，但对于我尝试的工作，它似乎与普通的dplyr不太一样

sample_data <- RxXdfData("./data/test_set.xdf") #sample xdf for testing
as_data_frame <- rxXdfToDataFrame(sample_data) #same data as dataframe

# Calculate freq by Buildings Sum Insured band

Importing my sample data as a dataframe the below code works 下面的代码将我的示例数据作为数据框导入

buildings_ad_fr <- as_data_frame %>% 
  mutate(bd_cut = cut(BD_INSURED_VALUE, seq(from = 150000, to = 10000000,by = 5000000))) %>% 
  group_by(bd_cut) %>% 
  summarise(exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),
            ad_pd_f = sum(ACT_AD_PD_CLAIM_COUNT)/sum(BENEFIT_EXPOSURE, na.rm = TRUE))

But I cant do the same thing using the xdf version of the data 但是我无法使用数据的xdf版本执行相同的操作

buildings_ad_fr_xdf <- sample_data %>% 
      mutate(bd_cut = cut(BD_INSURED_VALUE, seq(from = 150000, to = 10000000,by = 5000000))) %>% 
      group_by(bd_cut) %>% 
      summarise(exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),
                ad_pd_f = sum(ACT_AD_PD_CLAIM_COUNT)/sum(BENEFIT_EXPOSURE, na.rm = TRUE))

A workaround I can think would be to use rxDataStep to create the new column by passing through bd_cut = cut(BD_INSURED_VALUE, seq(from = 150000, to = 10000000,by = 5000000)) in the transforms argument, but it shouldn't be necessary to have an intermediate step. 我可以想到的解决方法是使用rxDataStep通过在transforms参数中传递bd_cut = cut(BD_INSURED_VALUE, seq(from = 150000, to = 10000000,by = 5000000))来创建新列，但不应有必要采取中间步骤。

I've tried using the .rxArgs function before the group_by expression but that also doesn't seem to work 我试过在group_by表达式之前使用.rxArgs函数，但这似乎也不起作用

buildings_ad_fr <- sample_data %>% 
  mutate(sample_data,.rxArgs = list(transforms = list(bd_cut = cut(BD_INSURED_VALUE,
                                                                   seq(150000,
                                                                       10000000,
                                                                       5000000)))))%>%
  group_by(bd_cut) %>% 
    summarise(exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),
            ad_pd_f = sum(ACT_AD_PD_CLAIM_COUNT)/sum(BENEFIT_EXPOSURE, na.rm = TRUE))

Both times on the xdf file it gives the error Error in summarise.RxFileData(., exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),: with xdf tbls only works with named variables, not expressions 两次在xdf文件上都给出Error in summarise.RxFileData(., exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),: with xdf tbls only works with named variables, not expressions错误Error.RxFileData Error in summarise.RxFileData(., exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),: with xdf tbls only works with named variables, not expressions

Now I know this package can factorise variables but I am not sure how to use it to split up a continuous variable 现在，我知道此程序包可以分解变量，但是我不确定如何使用它拆分连续变量

Does anyone know how to do this? 有谁知道如何做到这一点？

Answer 1

The mutate should be fine. mutate应该可以。 The summarise is different for Xdf files: 对于Xdf文件， summarise是不同的：

Internally summarise will run rxCube or rxSummary by default, which automatically remove NAs. 内部summarise将在默认情况下运行rxCube或rxSummary ，它们会自动删除NA。 You don't need na.rm=TRUE . 您不需要na.rm=TRUE 。
You can't summarise on an expression. 您不能总结一个表达式。 The solution is to run the summarise and then compute the expression: 解决方案是运行摘要，然后计算表达式：

xdf %>%
    group_by(*) %>%
    summarise(expos=sum(expos), pd=sum(clms)) %>%
    mutate(pd=pd/expos)

I've also just updated dplyXdf to 0.10.0 beta , which adds support for HDFS/Spark and dplyr 0.7 along with several nifty utility functions. 我还刚刚将dplyXdf更新为0.10.0 beta ，它增加了对HDFS / Spark和dplyr 0.7的支持以及一些实用的实用程序功能。 If you're not using it already, you might want to check it out. 如果尚未使用它，则可能需要将其签出。 The formal release should happen when the next MRS version comes out. 正式版本应在下一个MRS版本问世时进行。

使用dplyrXdf将连续变量转换为类别

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-08-14 02:17:32

使用dplyrXdf将连续变量转换为类别

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-08-14 02:17:32

解决方案1
1 已采纳 2017-08-14 02:17:32