[英]Converting continuous variable to categorical with dplyrXdf
I'm trying to perform some initial exploration of some data. 我正在尝试对某些数据进行一些初步探索。 I am busy analysing one-ways of continuous variables by converting them to factors and calculating frequencies by bands. 我正在忙于分析连续变量的单向方法,方法是将它们转换为因子并按频段计算频率。
I would like to do this with dplyrXdf but it doesn't seem to work the same as normal dplyr for what I'm attempting 我想用dplyrXdf做到这一点,但对于我尝试的工作,它似乎与普通的dplyr不太一样
sample_data <- RxXdfData("./data/test_set.xdf") #sample xdf for testing
as_data_frame <- rxXdfToDataFrame(sample_data) #same data as dataframe
# Calculate freq by Buildings Sum Insured band
Importing my sample data as a dataframe the below code works 下面的代码将我的示例数据作为数据框导入
buildings_ad_fr <- as_data_frame %>%
mutate(bd_cut = cut(BD_INSURED_VALUE, seq(from = 150000, to = 10000000,by = 5000000))) %>%
group_by(bd_cut) %>%
summarise(exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),
ad_pd_f = sum(ACT_AD_PD_CLAIM_COUNT)/sum(BENEFIT_EXPOSURE, na.rm = TRUE))
But I cant do the same thing using the xdf version of the data 但是我无法使用数据的xdf版本执行相同的操作
buildings_ad_fr_xdf <- sample_data %>%
mutate(bd_cut = cut(BD_INSURED_VALUE, seq(from = 150000, to = 10000000,by = 5000000))) %>%
group_by(bd_cut) %>%
summarise(exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),
ad_pd_f = sum(ACT_AD_PD_CLAIM_COUNT)/sum(BENEFIT_EXPOSURE, na.rm = TRUE))
A workaround I can think would be to use rxDataStep to create the new column by passing through bd_cut = cut(BD_INSURED_VALUE, seq(from = 150000, to = 10000000,by = 5000000))
in the transforms argument, but it shouldn't be necessary to have an intermediate step. 我可以想到的解决方法是使用rxDataStep通过在transforms参数中传递bd_cut = cut(BD_INSURED_VALUE, seq(from = 150000, to = 10000000,by = 5000000))
来创建新列,但不应有必要采取中间步骤。
I've tried using the .rxArgs function before the group_by
expression but that also doesn't seem to work 我试过在group_by
表达式之前使用.rxArgs函数,但这似乎也不起作用
buildings_ad_fr <- sample_data %>%
mutate(sample_data,.rxArgs = list(transforms = list(bd_cut = cut(BD_INSURED_VALUE,
seq(150000,
10000000,
5000000)))))%>%
group_by(bd_cut) %>%
summarise(exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),
ad_pd_f = sum(ACT_AD_PD_CLAIM_COUNT)/sum(BENEFIT_EXPOSURE, na.rm = TRUE))
Both times on the xdf file it gives the error Error in summarise.RxFileData(., exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),: with xdf tbls only works with named variables, not expressions
两次在xdf文件上都给出Error in summarise.RxFileData(., exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),: with xdf tbls only works with named variables, not expressions
错误Error.RxFileData Error in summarise.RxFileData(., exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),: with xdf tbls only works with named variables, not expressions
Now I know this package can factorise variables but I am not sure how to use it to split up a continuous variable 现在,我知道此程序包可以分解变量,但是我不确定如何使用它拆分连续变量
Does anyone know how to do this? 有谁知道如何做到这一点?
The mutate
should be fine. mutate
应该可以。 The summarise
is different for Xdf files: 对于Xdf文件, summarise
是不同的:
Internally summarise
will run rxCube
or rxSummary
by default, which automatically remove NAs. 内部summarise
将在默认情况下运行rxCube
或rxSummary
,它们会自动删除NA。 You don't need na.rm=TRUE
. 您不需要na.rm=TRUE
。
You can't summarise on an expression. 您不能总结一个表达式。 The solution is to run the summarise and then compute the expression: 解决方案是运行摘要,然后计算表达式:
xdf %>%
group_by(*) %>%
summarise(expos=sum(expos), pd=sum(clms)) %>%
mutate(pd=pd/expos)
I've also just updated dplyXdf to 0.10.0 beta , which adds support for HDFS/Spark and dplyr 0.7 along with several nifty utility functions. 我还刚刚将dplyXdf更新为0.10.0 beta ,它增加了对HDFS / Spark和dplyr 0.7的支持以及一些实用的实用程序功能。 If you're not using it already, you might want to check it out. 如果尚未使用它,则可能需要将其签出。 The formal release should happen when the next MRS version comes out. 正式版本应在下一个MRS版本问世时进行。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.