[英]How can I compare and sort two values within a data frame in R?
This is my first post so please go easy on me ;D 这是我的第一篇文章,所以请放轻松我; D
For some research that I am involved in, we have generated two area measurements for a spinal cord section. 对于我参与的一些研究,我们已经为脊髓切片产生了两个面积测量值。 The smaller measurement refers to a cavity formed by injury, and the larger area is the entire spinal cord.
较小的测量是指由损伤形成的腔,较大的区域是整个脊髓。 These measurements were made in Photoshop and exported with the same document name, but clearly different values.
这些测量是在Photoshop中进行的,并使用相同的文档名称导出,但值明显不同。 For example,
例如,
$`T7-B9_TileScan_005_Merging001_ch00.tif`
Label Document Area
1827 Measurement 39 T7-B9_TileScan_005_Merging001_ch00.tif 92,041.52
1831 Measurement 40 T7-B9_TileScan_005_Merging001_ch00.tif 3,952,865.00
This is actually a simplified version that I have created using the subset function of R to remove data. 这实际上是我使用R的子集函数创建的简化版本来删除数据。 The reason I have to do this is because the range of scar areas overlaps the range of total cord areas, meaning they can't be filtered with a simple size exclusion.
我必须这样做的原因是因为疤痕区域的范围与总脐带区域的范围重叠,这意味着它们不能通过简单的尺寸排除来过滤。
My example data set can be found here . 我的示例数据集可以在这里找到。 To generate this, please follow my [EDITED] work here.
为了生成这个,请在此处关注我的[EDITED]工作。
Scar.Ablation.Data <- read.csv("/Scar Ablation Data.csv", stringsAsFactors=F)
Adding stringsAsFactors=F corrected an error generated later on. 添加stringsAsFactors = F更正了稍后生成的错误。
test1 <- subset(Scar.Ablation.Data, Count != "", select = c(Label,Document,Area))
Removes all data that has no Count value. 删除所有没有Count值的数据。 When Photoshop exported the data, it did so with redundant measurements.
当Photoshop导出数据时,它通过冗余测量完成。 However all of these redundant measurements contained no Count value, and thus they can be removed with this.
但是,所有这些冗余测量都不包含Count值,因此可以使用它们将其删除。 The proposed alternative method did not work as R did not read no value in the Count column in as NA.
建议的替代方法不起作用,因为R没有在NA列中的Count列中读取任何值。
fileList = split(test1,test1$Document)
Generates a list where measurements are separated by Document name. 生成一个列表,其中测量值按文档名称分隔。
spineAreas = lapply(fileList, function(x) x[which(x$Area==max(x$Area)), ])
Takes each list (representing all the data for a given file name) and then finds and returns the data in the row with the largest area for each file. 获取每个列表(表示给定文件名的所有数据),然后查找并返回每个文件的面积最大的行中的数据。
scarAreas = lapply(fileList, function(x) x[which(x$Area==min(x$Area)), ])
We want the data from all rows whose area are less then the largest area, for each file. 对于每个文件,我们希望所有行的数据的面积小于最大面积。 Lapply returns a list, so now we want to turn them back into dataFrames
Lapply返回一个列表,所以现在我们想把它们变回dataFrames
spineData = do.call(rbind,spineAreas)
scarData = do.call(rbind,scarAreas)
row.names(spineData)=NULL #clean up row names
row.names(scarData)=NULL
write.csv(scarData, "/scarData.csv")
write.csv(spineData, "/spineData.csv")
When comparing my exports, the following problems arose: 比较我的导出时,出现以下问题:
This was resolved by switching x$Area<max
to x$Area==min
in the scarArea
's function. 这是通过在
scarArea
函数scarArea
x$Area<max
切换为x$Area==min
来scarArea
的。 The output, while still incorrect, did not change from this modification. 输出虽然仍然不正确,但没有改变。
I tried to try a different method of comparison using the aggregate()
function, but this returned data that was exactly the same as the data generated with the above method. 我尝试使用
aggregate()
函数尝试不同的比较方法,但这返回的数据与使用上述方法生成的数据完全相同。 However R is calculating these comparisons, it believes it is making the correct decision. 然而,R正在计算这些比较,它认为它正在做出正确的决定。 This may indicate that there is some sort of formatting or import problem with my numerical Area values.
这可能表示我的数字区域值存在某种格式或导入问题。
spineAreas2 = aggregate(Area ~ Document, data = test1, max)
scarAreas2 = aggregate(Area ~ Document, data = test1, min)
spineData2 = do.call(rbind,spineAreas2)
scarData2 = do.call(rbind,scarAreas2)
row.names(spineData2)=NULL #clean up row names
row.names(scarData2)=NULL #clean up row names
do.call(rbind, lapply(spineAreas, data.frame, stringsAsFactors=FALSE))
do.call(rbind, lapply(scarAreas, data.frame, stringsAsFactors=FALSE))
#Then clean up row names as in first example, or pass row.names=F
#when writing to a .csv file
write.csv(scarData2, "C/scarData2.csv")
write.csv(spineData2, "CspineData2.csv")
I am fine with swapping Null for 0 or NA, and I may try to do this in order to solve this problem. 我可以将Null换成0或NA,我可以尝试这样做以解决这个问题。 Thank you @Cole for your continued help through this problem, it is greatly appreciated.
感谢@Cole对此问题的持续帮助,非常感谢。
Ok, so if I understand you correctly, you want to a) clean the data (which you have already done) then b) divide the data by file name (also already done) then finally c) compare area measurements within each file type, the smaller ones are the scars, the largest one is the spinal column. 好的,所以如果我理解正确,你想a)清理数据(你已经完成了)然后b)按文件名划分数据(也已经完成)然后最后c)比较每种文件类型中的区域测量值,较小的是疤痕,最大的是脊柱。 You want to sort each one into an individual list, one for scar data, the other for spinal column data (the problem).
您希望将每个列表分类为单个列表,一个用于疤痕数据,另一个用于脊柱数据(问题)。
To do this we are going to use the lapply function. 为此,我们将使用lapply函数。 It takes each element of a matrix, array, or data frame and applies a function to it.
它接受矩阵,数组或数据帧的每个元素并向其应用函数。 Here we write our own function.
在这里,我们编写自己的函数。 It takes each list (representing all the data for a given file name) and then finds and returns the data in the row with the largest area for each file.
它接受每个列表(表示给定文件名的所有数据),然后查找并返回每个文件的面积最大的行中的数据。
spineAreas = lapply(fileList, function(x) x[which(x$Area==max(x$Area)), ])
Next we do the same thing, but this time we want the smaller areas for the scars. 接下来我们做同样的事情,但这次我们想要更小的区域为伤疤。 Thus we want the data from all rows whose area are less then the largest area, for each file.
因此,对于每个文件,我们希望来自面积小于最大面积的所有行的数据。 This approach assumes that the largest area for each file is the spinal cord crossection, and all other areas represent scars.
这种方法假设每个文件的最大区域是脊髓横断面,所有其他区域代表疤痕。
scarAreas = lapply(fileList, function(x) x[which(x$Area<max(x$Area)), ])
Lapply returns a list, so now we want to turn them back into dataFrames
. Lapply返回一个列表,所以现在我们想把它们变回
dataFrames
。
spineData = do.call(rbind,spineAreas)
scarData = do.call(rbind,scarAreas)
#clean up row names
row.names(spineData)=NULL
row.names(scarData)=NULL
The above approach will turn each string into a factor in your dataFrame
. 上述方法会将每个字符串转换为
dataFrame
一个因子。 If you don't want them as factors (occasionally can cause problems as they don't play nice with some functions) then you can do the following. 如果您不希望它们作为因素(偶尔会导致问题,因为它们对某些功能不起作用),那么您可以执行以下操作。
do.call(rbind, lapply(spineAreas, data.frame, stringsAsFactors=FALSE))
do.call(rbind, lapply(scarAreas, data.frame, stringsAsFactors=FALSE))
#Then clean up row names as in first example, or pass row.names=F
#when writing to a .csv file
Let me know if this is what you where trying to accomplish. 让我知道这是你想要完成的目标。
Now that I have a sample data set, I can see a few problems. 现在我有一个示例数据集,我可以看到一些问题。
The first problem is that you do not have a .csv
file. 第一个问题是您没有
.csv
文件。 csv
stands for comma separated values , and as you can see, your file does not contain commas between values. csv
代表逗号分隔值 ,如您所见,您的文件在值之间不包含逗号。 It looks like it is a tsv
or tab separated values file. 看起来它是一个
tsv
或tab分隔值文件。 In R, you want to read this in using the read.delim()
function as follows: 在R中,您希望使用
read.delim()
函数读取此内容,如下所示:
ablationData = read.delim("Scar Ablation Data.txt",stringsAsFactors=F)
(you may also want to consider nameing your data with a .tsv
extension if it is indeed tab separated) (您可能还想考虑使用
.tsv
扩展名来命名数据,如果它确实是制表符分隔的)
After reading in the data it is apparent that 在阅读数据后,很明显
NULL
object in R (notice all caps). NULL
对象不同(请注意所有大写)。 Using x=="Null"
is the correct way to test for these (as you where doing before). x=="Null"
是测试这些的正确方法(就像你之前做的那样)。 Count
data are represented by ""
values. Count
数据的读数由""
值表示。 I'm guessing this has to do with the nature of there being no values present in the .tsv
file being represented as ""
since there is nothing between the tabs. .tsv
文件中没有值表示为""
的性质有关,因为选项卡之间没有任何内容。 Note that if you where to use a different file format, such as .csv
the ""
would be read in as NA
instead. .csv
则""
将被读作NA
。 This comes down to how the R read.xxx
functions handle different file types and is a good thing to keep in mind for the future. read.xxx
函数如何处理不同的文件类型,并且对于将来要记住是一件好事。 Count
column represents the number of 'features' per measurement. Count
列表示每次测量的“要素”数。 It appears that each measurement
has a measurement #
row that is an aggregate summary of that measurement. measurement
都有一个measurement #
行,它是该测量的汇总摘要。 Then each feature
of the measurement
has its own row represented by measurement #-Feature #
. measurement
每个feature
都有自己的行,由measurement #-Feature #
。 Based on your description of the problem, you want to remove the individual 'feature' measurements and compare only the aggregate values for each measurement set . feature
rows because they are certainly NOT duplicate/redundant values as you stated they where above. feature
行,因为它们肯定不是重复/冗余值,正如您在上面所说的那样。 ""
or "Null"
values in many of our columns that otherwise contain numeric input. ""
或"Null"
值,否则包含数字输入。 This will cause all of the values in those columns to be cast as character
type instead of numeric
. character
类型而不是numeric
。 This is why the sorting from before was not working, because max()
works very differently on characters
as opposed to numerics
. max()
的作品非常不同的characters
,而不是numerics
。 After removing the offending "" and "Null"
values we will have to cast our desired columns to numeric
data types. "" and "Null"
值后,我们必须将所需的列转换为numeric
数据类型。 ,
and .
,
和.
in its numbers. ,
's in its numbers and will not know how to interpret them. ,
它的数字并不知道如何解释它们。 Thus, we will need to remove them In Summary: 综上所述:
.tsv
) .tsv
) "Null"
values (see note below) "Null"
值(见下面的注释) feature
measurements, keeping only the aggregate data for each measurement
set. feature
测量,仅保留每个measurement
集的聚合数据。 ,
from numbers. ,
从数字。 numeric
numeric
列转换为numeric
measurement
with the largest Area
. Area
最大的聚合measurement
。 This represents the spinal column measurement
values represent scars. measurement
值代表疤痕。 A Question: Are you sure you want to separate based on file and then compare only aggregate measurements, or do you really want to separate based on measurement and then compare each feature within that measurement? 问题:您确定要基于文件进行分离,然后仅比较聚合测量,或者您是否真的想基于测量进行分离,然后比较该测量中的每个特征 ?
The spineData should have been the only list to contain "Null"
values. spineData应该是唯一包含
"Null"
值的列表。 This is because the max()
and min()
of a data set consisting entirely of "Null"
is simply "Null"
. 这是因为完全由
"Null"
组成的数据集的max()
和min()
只是"Null"
。 Thus == max(data)
will be true for each "Null"
data point (ie "Null"=="Null"
) but < max(data)
will be false for each "Null"
data point (ie. "Null"< "Null"
). 因此,对于每个
"Null"
数据点, == max(data)
将为真(即"Null"=="Null"
),但对于每个"Null"
数据点, < max(data)
将为false(即"Null"< "Null"
)。 I really don't think you want to use ==min(data)
because then you are going to throw out all intermediate values (presumably valid scar measurements) for each file where you have non-"Null" data. 我真的不认为你想使用
==min(data)
因为那样你将丢弃所有非“空”数据的文件的所有中间值(可能是有效的疤痕测量)。
If you really want to keep the "Null"
reads in your data set, I would recommend pulling them out, processing the rest of the data, and then adding them back in at the end. 如果你真的想在数据集中保留
"Null"
读数,我建议将它们拉出来,处理剩下的数据,然后在最后添加它们。
Read in data. 读入数据。
data = read.delim("Scar Ablation Data.tsv",stringsAsFactors=F)
Separate out "Null"
measurements 分开
"Null"
测量
data2 = data[-which(data$Area=="Null"),]
Remove all feature
measurements, keeping only aggregate data for each measurement
. 删除所有
feature
测量,仅保留每次measurement
汇总数据。 Keep only Label
, Document
, and Area
columns. 仅保留
Label
, Document
和Area
列。
data2 = data2[-which(data2$Count==""),c("Label","Document","Area")]
For desired columns containing numeric data, remove ,
from numbers and cast to type numeric
. 对于包含数字数据的所需列
,
从数字中删除并转换为numeric
类型。
data2$Area = as.numeric(gsub(",","",data2$Area))
Separate data by file/ Document
name. 按文件/
Document
名称分隔数据。
fileList = split(data2,data2$Document)
Process each file. 处理每个文件。 The largest
Area
value represents the spinal column, all other (smaller) values represent scars. 最大的
Area
值代表脊柱,所有其他(较小的)值代表疤痕。 Each of these statements returns a list with our desired results. 这些语句中的每一个都返回一个包含所需结果的列表。
spineAreas = lapply(fileList, function(x) x[which(x$Area==max(x$Area)), ])
scarAreas = lapply(fileList, function(x) x[which(x$Area<max(x$Area)), ])
Convert back to dataFrame
. 转换回
dataFrame
。 Here I have added an extra step to avoid our data being converted to factors. 在这里,我添加了一个额外的步骤,以避免我们的数据转换为因子。
spineAreas = do.call(rbind, lapply(spineAreas, data.frame, stringsAsFactors=FALSE))
scarAreas = do.call(rbind, lapply(scarAreas, data.frame, stringsAsFactors=FALSE))
Add files with "Null"
reads back in and clean up row names. 添加带有
"Null"
文件会读回并清理行名称。 Do this only when completely done analyzing data 仅在完成分析数据时才执行此操作
nullDocs = match(unique(data$Document[data$Label=="Null"]),data$Document)
nullDocs = data.frame(data[nullDocs,c("Label","Document","Area")],stringsAsFactors=F)
scarAreas = rbind(nullDocs,scarAreas)
spineAreas = rbind(nullDocs,spineAreas)
row.names(scarAreas)=NULL
row.names(spineAreas)=NULL
By adding the "Null"
values back in, our Area
column will be forced back to the character
type since each element in a column must be of the same data type. 通过添加
"Null"
值,我们的Area
列将被强制返回到character
类型,因为列中的每个元素必须具有相同的数据类型。 This is important because it means that you cannot really do any meaningful operations in R on your data. 这很重要,因为这意味着您无法在R上对数据执行任何有意义的操作。
For example: spineAreas$Area>scarAreas$Area
will return 例如:
spineAreas$Area>scarAreas$Area
将返回
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
[23] TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
Which might lead us to believe that we did not sort our data correctly. 这可能会让我们相信我们没有正确排序数据。
However: as.numeric(spineAreas$Area)>as.numeric(scarAreas$Area)
will return 但是:
as.numeric(spineAreas$Area)>as.numeric(scarAreas$Area)
将返回
[1] NA NA NA TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[28] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
This indicates that the first 3 values where strings (in this case "Null"
) which where replaced by NA
and then indicates that our data is correctly sorted. 这表示前3个值,其中字符串(在本例中为
"Null"
)由NA
替换,然后表示我们的数据已正确排序。
So either add the "Null"
values back when you are completely done with data analysis, or recast your desired columns to numerics (eg. spineAreas$Area = as.numeric(spineAreas$Area)
) 因此,当您完成数据分析,或者将您想要的列重新编号为数字时,请将
"Null"
值添加回来(例如, spineAreas$Area = as.numeric(spineAreas$Area)
)
Read in your data so that all ""
and "Null"
are represented by NA
. 读入您的数据,以便所有
""
和"Null"
由NA
表示。 This will make life a lot easier, but will not save you from having to remove the ,
and cast your data to numeric. 这将使生活变得更加轻松,但不会让您不必删除
,
并将数据转换为数字。
Here are the lines you would need to change 以下是您需要更改的行
data = read.delim("Scar Ablation Data.tsv",na.strings=c("NA","Null",""),stringsAsFactors=F)
data2 = data[-which(is.na(data$Area)),]
data2 = data2[-which(is.na(data2$Count)),c("Label","Document","Area")]
nullDocs = match(unique(data$Document[is.na(data$Label)]),data$Document)
This will keep your data as numeric
even after adding back the null reads and is probably the preferred way to do things. 这将使您的数据保持为
numeric
即使在添加空读取后也可能是首选的方法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.