[英]How to copy specific values from one data column to another while matching other columns in R?
I've searched a number of places (stackoverflow, r-blogger, etc), but haven't quite found a good option for doing this in R. Hopefully someone has some ideas. 我搜索了很多地方(stackoverflow,r-blogger等),但还没有找到一个很好的选择在R中这样做。希望有人有一些想法。
I have a set of environmental sampling data. 我有一套环境采样数据。 The data includes a variety of fields (visit date, region, location, sample medium, sample component, result, etc.).
数据包括各种字段(访问日期,区域,位置,样本介质,样本组件,结果等)。
Here's a subset of the pertinent fields. 这是相关领域的一个子集。 This is where I start...
这是我开始的地方......
visit_date region location media component result
1990-08-20 LAKE 555723 water Mg *Nondetect
1999-07-01 HILL 432422 water Ca 3.2
2010-09-12 LAKE 555723 water pH 6.8
2010-09-12 LAKE 555723 water Mg 2.1
2010-09-12 HILL 432423 water pH 7.2
2010-09-12 HILL 432423 water N 0.8
2010-09-12 HILL 432423 water NH4 112
What I hope to reach is a table/dataframe like this: 我希望达到的是这样的表/数据帧:
visit_date region location media component result pH
1990-08-20 LAKE 555723 water Mg *Nondetect *Not recorded
1999-07-01 HILL 432422 water Ca 3.2 *Not recorded
2010-09-12 LAKE 555723 water pH 6.8 6.8
2010-09-12 LAKE 555723 water Mg 2.1 6.8
2010-09-12 HILL 432423 water pH 7.2 7.2
2010-09-12 HILL 432423 water N 0.8 7.2
2010-09-12 HILL 432423 water NH4 112 7.2
I attempted to use the method here -- R finding rows of a data frame where certain columns match those of another -- but unfortunately didn't get to the result I wanted. 我试图在这里使用这个方法--R找到一些数据帧的行,其中某些列与另一列匹配 - 但遗憾的是没有得到我想要的结果。 Instead the pH column was either my pre-populated value
-999
or NA
and not the pH value for that particular visit date if it was collected. 相反,pH柱是我预先填充的值
-999
或NA
而不是如果收集的那个特定访问日期的pH值。 Since the result data set is around 500k records, I'm using unique(tResult$pH)
to determine the values of the pH column. 由于结果数据集大约是500k记录,我使用
unique(tResult$pH)
来确定pH柱的值。
Here's that attempt. 这是尝试。
res
is the original result data.frame and component
would be the pH result subset (the pH sample results from the main results table). res
是原始结果data.frame和component
将是pH结果子集(pH样本来自主要结果表)。
keys <- c("region", "location", "visit_date", "media")
tResults <- data.table(res, key=keys)
tComponent <- data.table(component, key=keys)
tResults[tComponent, pH>0]
I've attempted using match
, merge
, and within
on the original data frame without success. 我试图在原始数据框架上使用
match
, merge
和within
而没有成功。 Since then I've generated a subset for the components (pH in this example) where I copied over the results column to a new "pH" column, thinking I could match the keys and update a new "pH" column in the main result set. 从那时起,我已经为组件(本例中的pH)生成了一个子集,我将结果列复制到新的“pH”列,认为我可以匹配键并更新主要结果中的新“pH”列组。
Since not all result values are numeric (with values like *Not recorded
) I attempted to use numerics like -888
or other values which could substitute so I could force at least the result and pH columns to be numeric. 由于并非所有结果值都是数字的(值为
*Not recorded
)我尝试使用数字如-888
或其他可以替代的值,因此我可以强制至少结果和pH -888
数字。 Aside from the dates which are POSIXct
values, the remaining columns are character
columns. 除了
POSIXct
值的日期之外,其余列是character
列。 Original dataframe was created using StringsAsFactors=FALSE
. 原始数据
StringsAsFactors=FALSE
是使用StringsAsFactors=FALSE
创建的。
Once I can do this, I'll be able to generate similar columns for other components that can be used to populate and calculate other values for a given sample. 一旦我能够做到这一点,我将能够为其他组件生成类似的列,可用于填充和计算给定样本的其他值。 At least that's my goal.
至少这是我的目标。
So I'm stumped on this one. 所以我对这个很难过。 In my mind it should be easy but I'm certainly NOT seeing it!
在我看来它应该很容易但我肯定没有看到它!
Your help and ideas are certainly welcome and appreciated! 您的帮助和想法当然是受欢迎和赞赏!
#df1 is your first data set and is dataframe
df1$phtem<-with(df1,ifelse(component=="pH",result,NA))
library(data.table)
library(zoo) # locf function
setDT(df1)[,pH:=na.locf(phtem,na.rm = FALSE)]
visit_date region location media component result phtem pH
1: 1990-08-20 LAKE 555723 water Mg *Nondetect NA NA
2: 1999-07-01 HILL 432422 water Ca 3.2 NA NA
3: 2010-09-12 LAKE 555723 water pH 6.8 6.8 6.8
4: 2010-09-12 LAKE 555723 water Mg 2.1 NA 6.8
5: 2010-09-12 HILL 432423 water pH 7.2 7.2 7.2
6: 2010-09-12 HILL 432423 water N 0.8 NA 7.2
7: 2010-09-12 HILL 432423 water NH4 112 NA 7.2
# you can delete phtem if you don't need. #如果你不需要,可以删除。
Edit: 编辑:
library(data.table)
setDT(df1)[,pH:=result[component=="pH"],by="region,location,visit_date,media"]
df1
visit_date region location media component result pH
1: 1990-08-20 LAKE 555723 water Mg *Nondetect NA
2: 1999-07-01 HILL 432422 water Ca 3.2 NA
3: 2010-09-12 LAKE 555723 water pH 6.8 6.8
4: 2010-09-12 LAKE 555723 water Mg 2.1 6.8
5: 2010-09-12 HILL 432423 water pH 7.2 7.2
6: 2010-09-12 HILL 432423 water N 0.8 7.2
7: 2010-09-12 HILL 432423 water NH4 112 7.2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.