[英]R checking pairs of rows in a dataframe
I have a data frame holding information on options like this 我有一个数据框,其中包含有关此类选项的信息
> chData
myIdx strike_price date exdate cp_flag strike_price return
1 8355342 605000 1996-04-02 1996-05-18 P 605000 0.002340
2 8355433 605000 1996-04-02 1996-05-18 C 605000 0.002340
3 8356541 605000 1996-04-09 1996-05-18 P 605000 -0.003182
4 8356629 605000 1996-04-09 1996-05-18 C 605000 -0.003182
5 8358033 605000 1996-04-16 1996-05-18 P 605000 0.003907
6 8358119 605000 1996-04-16 1996-05-18 C 605000 0.003907
7 8359391 605000 1996-04-23 1996-05-18 P 605000 0.005695
where cp_flag means that a certain option is either a call or a put. 其中cp_flag表示某个选项是call或put。 What is a way to make sure that for each date, there is a both a call and a put, and drop the rows for which this does not exist?
有什么方法可以确保每个日期都有一个调用和一个put,并删除不存在的行? I can do it with a for loop, but is there a more clever way?
我可以用for循环来做,但是有更聪明的方法吗?
Get the dates that have P's and those that have C's, and use intersect to find the dates that have both. 获取具有P的日期和具有C的日期,并使用相交来查找具有两者的日期。
keep_dates <- with(x, intersect(date[cp_flag=='P'], date[cp_flag=='C']) )
# "1996-04-02" "1996-04-09" "1996-04-16"
Keep only the rows that have dates appearing in keep_dates. 仅保留keep_dates中出现日期的行。
x[ x$date %in% keep_dates, ]
# myIdx strike_price date exdate cp_flag strike_price.1
# 8355342 605000 1996-04-02 1996-05-18 P 605000
# 8355433 605000 1996-04-02 1996-05-18 C 605000
# 8356541 605000 1996-04-09 1996-05-18 P 605000
# 8356629 605000 1996-04-09 1996-05-18 C 605000
# 8358033 605000 1996-04-16 1996-05-18 P 605000
# 8358119 605000 1996-04-16 1996-05-18 C 605000
Using the plyr
package: 使用
plyr
包:
> ddply(chData, "date", function(x) if(all(c("P","C") %in% x$cp_flag)) x)
myIdx strike_price date exdate cp_flag strike_price.1 return
1 8355342 605000 1996-04-02 1996-05-18 P 605000 0.002340
2 8355433 605000 1996-04-02 1996-05-18 C 605000 0.002340
3 8356541 605000 1996-04-09 1996-05-18 P 605000 -0.003182
4 8356629 605000 1996-04-09 1996-05-18 C 605000 -0.003182
5 8358033 605000 1996-04-16 1996-05-18 P 605000 0.003907
6 8358119 605000 1996-04-16 1996-05-18 C 605000 0.003907
Here's a reshape
approach. 这是一种
reshape
方法。
library(reshape)
#Add a dummy value
df$value <- 1
check <- cast(df, myIdx + strike_price + date + exdate + strike_price + return ~ cp_flag)
#take stock of what just happened
summary(check)
#use only complete cases. If you have NAs elsewhere, this will knock out those obs too
check <- check[complete.cases(check),]
#back to original form
df.clean <- melt(check, id = 1:6)
Here's one way using split
and lapply
: 这是使用
split
和lapply
的一种方式:
> tmp <- lapply(split(d, list(d$date)), function(x) if(all(c('P', 'C') %in% x[, 5])) x)
> do.call(rbind, tmp)
myIdx strike_price date exdate cp_flag strike_price return
1996-05-18.1 8355342 605000 1996-04-02 1996-05-18 P 605000 0.002340
1996-05-18.2 8355433 605000 1996-04-02 1996-05-18 C 605000 0.002340
1996-05-18.3 8356541 605000 1996-04-09 1996-05-18 P 605000 -0.003182
1996-05-18.4 8356629 605000 1996-04-09 1996-05-18 C 605000 -0.003182
1996-05-18.5 8358033 605000 1996-04-16 1996-05-18 P 605000 0.003907
1996-05-18.6 8358119 605000 1996-04-16 1996-05-18 C 605000 0.003907
1996-05-18.7 8359391 605000 1996-04-23 1996-05-18 P 605000 0.005695
Edit: Here's the full version implied by my last answer. 编辑:这是我上一个答案隐含的完整版本。 I tend to think in base functions rather than plyr or reshape... but these answers seem good too.
我倾向于考虑基本功能而不是plyr或重塑...但这些答案似乎也很好。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.