简体   繁体   English

如何在R中的数据框内比较和排序两个值?

[英]How can I compare and sort two values within a data frame in R?

This is my first post so please go easy on me ;D 这是我的第一篇文章,所以请放轻松我; D

For some research that I am involved in, we have generated two area measurements for a spinal cord section. 对于我参与的一些研究,我们已经为脊髓切片产生了两个面积测量值。 The smaller measurement refers to a cavity formed by injury, and the larger area is the entire spinal cord. 较小的测量是指由损伤形成的腔,较大的区域是整个脊髓。 These measurements were made in Photoshop and exported with the same document name, but clearly different values. 这些测量是在Photoshop中进行的,并使用相同的文档名称导出,但值明显不同。 For example, 例如,

$`T7-B9_TileScan_005_Merging001_ch00.tif`
              Label                               Document         Area
1827 Measurement 39 T7-B9_TileScan_005_Merging001_ch00.tif    92,041.52
1831 Measurement 40 T7-B9_TileScan_005_Merging001_ch00.tif 3,952,865.00

This is actually a simplified version that I have created using the subset function of R to remove data. 这实际上是我使用R的子集函数创建的简化版本来删除数据。 The reason I have to do this is because the range of scar areas overlaps the range of total cord areas, meaning they can't be filtered with a simple size exclusion. 我必须这样做的原因是因为疤痕区域的范围与总脐带区域的范围重叠,这意味着它们不能通过简单的尺寸排除来过滤。

My example data set can be found here . 我的示例数据集可以在这里找到。 To generate this, please follow my [EDITED] work here. 为了生成这个,请在​​此处关注我的[EDITED]工作。

Scar.Ablation.Data <- read.csv("/Scar Ablation Data.csv", stringsAsFactors=F)

Adding stringsAsFactors=F corrected an error generated later on. 添加stringsAsFactors = F更正了稍后生成的错误。

test1 <- subset(Scar.Ablation.Data, Count != "", select = c(Label,Document,Area))

Removes all data that has no Count value. 删除所有没有Count值的数据。 When Photoshop exported the data, it did so with redundant measurements. 当Photoshop导出数据时,它通过冗余测量完成。 However all of these redundant measurements contained no Count value, and thus they can be removed with this. 但是,所有这些冗余测量都不包含Count值,因此可以使用它们将其删除。 The proposed alternative method did not work as R did not read no value in the Count column in as NA. 建议的替代方法不起作用,因为R没有在NA列中的Count列中读取任何值。

fileList = split(test1,test1$Document)

Generates a list where measurements are separated by Document name. 生成一个列表,其中测量值按文档名称分隔。

spineAreas = lapply(fileList, function(x) x[which(x$Area==max(x$Area)), ])

Takes each list (representing all the data for a given file name) and then finds and returns the data in the row with the largest area for each file. 获取每个列表(表示给定文件名的所有数据),然后查找并返回每个文件的面积最大的行中的数据。

scarAreas = lapply(fileList, function(x) x[which(x$Area==min(x$Area)), ])

We want the data from all rows whose area are less then the largest area, for each file. 对于每个文件,我们希望所有行的数据的面积小于最大面积。 Lapply returns a list, so now we want to turn them back into dataFrames Lapply返回一个列表,所以现在我们想把它们变回dataFrames

spineData = do.call(rbind,spineAreas)
scarData = do.call(rbind,scarAreas)
row.names(spineData)=NULL #clean up row names
row.names(scarData)=NULL
write.csv(scarData, "/scarData.csv")
write.csv(spineData, "/spineData.csv")

When comparing my exports, the following problems arose: 比较我的导出时,出现以下问题:

  1. spineData contained Null values, but scarData did not. spineData包含Null值,但scarData没有。

This was resolved by switching x$Area<max to x$Area==min in the scarArea 's function. 这是通过在scarArea函数scarArea x$Area<max切换为x$Area==minscarArea的。 The output, while still incorrect, did not change from this modification. 输出虽然仍然不正确,但没有改变。

  1. The comparison between Areas does not always work. 区域之间的比较并不总是有效。 For example, for sample "C1-B3_TileScan_002_Merging001_ch00.tif", the scar reported a higher area than the cord. 例如,对于样品“C1-B3_TileScan_002_Merging001_ch00.tif”,瘢痕报告的面积高于脐带。

I tried to try a different method of comparison using the aggregate() function, but this returned data that was exactly the same as the data generated with the above method. 我尝试使用aggregate()函数尝试不同的比较方法,但这返回的数据与使用上述方法生成的数据完全相同。 However R is calculating these comparisons, it believes it is making the correct decision. 然而,R正在计算这些比较,它认为它正在做出正确的决定。 This may indicate that there is some sort of formatting or import problem with my numerical Area values. 这可能表示我的数字区域值存在某种格式或导入问题。

spineAreas2 = aggregate(Area ~ Document, data = test1, max)
scarAreas2 = aggregate(Area ~ Document, data = test1, min)

spineData2 = do.call(rbind,spineAreas2)
scarData2 = do.call(rbind,scarAreas2)

row.names(spineData2)=NULL #clean up row names
row.names(scarData2)=NULL #clean up row names

do.call(rbind, lapply(spineAreas, data.frame, stringsAsFactors=FALSE))
do.call(rbind, lapply(scarAreas, data.frame, stringsAsFactors=FALSE))
#Then clean up row names as in first example, or pass row.names=F 
#when writing to a .csv file

write.csv(scarData2, "C/scarData2.csv")
write.csv(spineData2, "CspineData2.csv")

I am fine with swapping Null for 0 or NA, and I may try to do this in order to solve this problem. 我可以将Null换成0或NA,我可以尝试这样做以解决这个问题。 Thank you @Cole for your continued help through this problem, it is greatly appreciated. 感谢@Cole对此问题的持续帮助,非常感谢。

Ok, so if I understand you correctly, you want to a) clean the data (which you have already done) then b) divide the data by file name (also already done) then finally c) compare area measurements within each file type, the smaller ones are the scars, the largest one is the spinal column. 好的,所以如果我理解正确,你想a)清理数据(你已经完成了)然后b)按文件名划分数据(也已经完成)然后最后c)比较每种文件类型中的区域测量值,较小的是疤痕,最大的是脊柱。 You want to sort each one into an individual list, one for scar data, the other for spinal column data (the problem). 您希望将每个列表分类为单个列表,一个用于疤痕数据,另一个用于脊柱数据(问题)。

To do this we are going to use the lapply function. 为此,我们将使用lapply函数。 It takes each element of a matrix, array, or data frame and applies a function to it. 它接受矩阵,数组或数据帧的每个元素并向其应用函数。 Here we write our own function. 在这里,我们编写自己的函数。 It takes each list (representing all the data for a given file name) and then finds and returns the data in the row with the largest area for each file. 它接受每个列表(表示给定文件名的所有数据),然后查找并返回每个文件的面积最大的行中的数据。

spineAreas = lapply(fileList, function(x) x[which(x$Area==max(x$Area)), ])

Next we do the same thing, but this time we want the smaller areas for the scars. 接下来我们做同样的事情,但这次我们想要更小的区域为伤疤。 Thus we want the data from all rows whose area are less then the largest area, for each file. 因此,对于每个文件,我们希望来自面积小于最大面积的所有行的数据。 This approach assumes that the largest area for each file is the spinal cord crossection, and all other areas represent scars. 这种方法假设每个文件的最大区域是脊髓横断面,所有其他区域代表疤痕。

scarAreas = lapply(fileList, function(x) x[which(x$Area<max(x$Area)), ])

Lapply returns a list, so now we want to turn them back into dataFrames . Lapply返回一个列表,所以现在我们想把它们变回dataFrames

spineData = do.call(rbind,spineAreas)
scarData = do.call(rbind,scarAreas)
#clean up row names
row.names(spineData)=NULL
row.names(scarData)=NULL

The above approach will turn each string into a factor in your dataFrame . 上述方法会将每个字符串转换为dataFrame一个因子。 If you don't want them as factors (occasionally can cause problems as they don't play nice with some functions) then you can do the following. 如果您不希望它们作为因素(偶尔会导致问题,因为它们对某些功能不起作用),那么您可以执行以下操作。

do.call(rbind, lapply(spineAreas, data.frame, stringsAsFactors=FALSE))
do.call(rbind, lapply(scarAreas, data.frame, stringsAsFactors=FALSE))
#Then clean up row names as in first example, or pass row.names=F 
#when writing to a .csv file

Let me know if this is what you where trying to accomplish. 让我知道这是你想要完成的目标。

Summary of the problem 问题摘要

Now that I have a sample data set, I can see a few problems. 现在我有一个示例数据集,我可以看到一些问题。

The first problem is that you do not have a .csv file. 第一个问题是您没有 .csv文件。 csv stands for comma separated values , and as you can see, your file does not contain commas between values. csv代表逗号分隔值 ,如您所见,您的文件在值之间不包含逗号。 It looks like it is a tsv or tab separated values file. 看起来它是一个tsvtab分隔值文件。 In R, you want to read this in using the read.delim() function as follows: 在R中,您希望使用read.delim()函数读取此内容,如下所示:

ablationData = read.delim("Scar Ablation Data.txt",stringsAsFactors=F)

(you may also want to consider nameing your data with a .tsv extension if it is indeed tab separated) (您可能还想考虑使用.tsv扩展名来命名数据,如果它确实是制表符分隔的)

After reading in the data it is apparent that 在阅读数据后,很明显

  1. For 'bad' reads, the file contains "Null" which is different than the NULL object in R (notice all caps). 对于“坏”读取,该文件包含“Null”,它与R中的NULL对象不同(请注意所有大写)。 Using x=="Null" is the correct way to test for these (as you where doing before). 使用x=="Null"是测试这些的正确方法(就像你之前做的那样)。
  2. Reads with no Count data are represented by "" values. 没有Count数据的读数由""值表示。 I'm guessing this has to do with the nature of there being no values present in the .tsv file being represented as "" since there is nothing between the tabs. 我猜这与.tsv文件中没有值表示为""的性质有关,因为选项卡之间没有任何内容。 Note that if you where to use a different file format, such as .csv the "" would be read in as NA instead. 请注意,如果您在何处使用不同的文件格式,例如.csv ""将被读作NA This comes down to how the R read.xxx functions handle different file types and is a good thing to keep in mind for the future. 这归结为R read.xxx函数如何处理不同的文件类型,并且对于将来要记住是一件好事。
  3. The Count column represents the number of 'features' per measurement. Count列表示每次测量的“要素”数。 It appears that each measurement has a measurement # row that is an aggregate summary of that measurement. 似乎每个measurement都有一个measurement #行,它是该测量的汇总摘要。 Then each feature of the measurement has its own row represented by measurement #-Feature # . 然后, measurement每个feature都有自己的行,由measurement #-Feature # Based on your description of the problem, you want to remove the individual 'feature' measurements and compare only the aggregate values for each measurement set . 根据您对问题的描述,您希望删除单个“特征”测量值,并仅比较每个测量集聚合值 I'm not sure if this is what you are actually intending/want to do, so I would think carefully about why you are removeing the individual feature rows because they are certainly NOT duplicate/redundant values as you stated they where above. 我不确定这是否是您实际打算/想要做的事情,因此我会仔细考虑您为什么要删除单个feature行,因为它们肯定不是重复/冗余值,正如您在上面所说的那样。
  4. As mentioned above, we have "" or "Null" values in many of our columns that otherwise contain numeric input. 如上所述,我们在许多列中都有"""Null"值,否则包含数字输入。 This will cause all of the values in those columns to be cast as character type instead of numeric . 这将导致这些列中的所有值都转换为character类型而不是numeric This is why the sorting from before was not working, because max() works very differently on characters as opposed to numerics . 这就是为什么从之前的排序是行不通的,因为max()的作品非常不同的characters ,而不是numerics After removing the offending "" and "Null" values we will have to cast our desired columns to numeric data types. 删除有问题的"" and "Null"值后,我们必须将所需的列转换为numeric数据类型。
  5. Another problem with the data is that it contains both , and . 随着数据的另一个问题是,它同时包含,. in its numbers. 在它的数量。 R does not like , 's in its numbers and will not know how to interpret them. R不喜欢,它的数字并不知道如何解释它们。 Thus, we will need to remove them 因此,我们需要删除它们

In Summary: 综上所述:

  • Read in data (as .tsv ) 读入数据(如.tsv
  • Separate out all "Null" values (see note below) 分离出所有"Null"值(见下面的注释)
  • Remove all individual feature measurements, keeping only the aggregate data for each measurement set. 删除所有单个feature测量,仅保留每个measurement集的聚合数据。
  • Remove all , from numbers. 删除所有,从数字。
  • Cast columns containing only number to numeric 将仅包含numeric列转换为numeric
  • Separate the data by file name 按文件名分隔数据
  • Process each file 处理每个文件
    • Find the aggregate measurement with the largest Area . 找到Area最大的聚合measurement This represents the spinal column 这代表脊柱
    • All other measurement values represent scars. 所有其他measurement值代表疤痕。
    • Separate the results into two different data sets. 将结果分成两个不同的数据集。 One for scars, one for spinal columns. 一个用于疤痕,一个用于脊柱。
  • Add the "Null" values back in (see note below) 重新添加“Null”值(请参阅下面的注释)

A Question: Are you sure you want to separate based on file and then compare only aggregate measurements, or do you really want to separate based on measurement and then compare each feature within that measurement? 问题:您确定要基于文件进行分离,然后仅比较聚合测量,或者您是否真的想基于测量进行分离,然后比较该测量中的每个特征

Note on previous answer 请注意以前的答案

The spineData should have been the only list to contain "Null" values. spineData应该是唯一包含"Null"值的列表。 This is because the max() and min() of a data set consisting entirely of "Null" is simply "Null" . 这是因为完全由"Null"组成的数据集的max()min()只是"Null" Thus == max(data) will be true for each "Null" data point (ie "Null"=="Null" ) but < max(data) will be false for each "Null" data point (ie. "Null"< "Null" ). 因此,对于每个"Null"数据点, == max(data)将为真(即"Null"=="Null" ),但对于每个"Null"数据点, < max(data)将为false(即"Null"< "Null" )。 I really don't think you want to use ==min(data) because then you are going to throw out all intermediate values (presumably valid scar measurements) for each file where you have non-"Null" data. 我真的不认为你想使用==min(data)因为那样你将丢弃所有非“空”数据的文件的所有中间值(可能是有效的疤痕测量)。

If you really want to keep the "Null" reads in your data set, I would recommend pulling them out, processing the rest of the data, and then adding them back in at the end. 如果你真的想在数据集中保留"Null"读数,我建议将它们拉出来,处理剩下的数据,然后在最后添加它们。

Solution

Read in data. 读入数据。

 data = read.delim("Scar Ablation Data.tsv",stringsAsFactors=F)

Separate out "Null" measurements 分开"Null"测量

data2 = data[-which(data$Area=="Null"),]

Remove all feature measurements, keeping only aggregate data for each measurement . 删除所有feature测量,仅保留每次measurement汇总数据。 Keep only Label , Document , and Area columns. 仅保留LabelDocumentArea列。

data2 = data2[-which(data2$Count==""),c("Label","Document","Area")]

For desired columns containing numeric data, remove , from numbers and cast to type numeric . 对于包含数字数据的所需列,从数字中删除并转换为numeric类型。

data2$Area = as.numeric(gsub(",","",data2$Area))

Separate data by file/ Document name. 按文件/ Document名称分隔数据。

fileList = split(data2,data2$Document)

Process each file. 处理每个文件。 The largest Area value represents the spinal column, all other (smaller) values represent scars. 最大的Area值代表脊柱,所有其他(较小的)值代表疤痕。 Each of these statements returns a list with our desired results. 这些语句中的每一个都返回一个包含所需结果的列表。

spineAreas = lapply(fileList, function(x) x[which(x$Area==max(x$Area)), ])
scarAreas = lapply(fileList, function(x) x[which(x$Area<max(x$Area)), ])

Convert back to dataFrame . 转换回dataFrame Here I have added an extra step to avoid our data being converted to factors. 在这里,我添加了一个额外的步骤,以避免我们的数据转换为因子。

spineAreas = do.call(rbind, lapply(spineAreas, data.frame, stringsAsFactors=FALSE))
scarAreas = do.call(rbind, lapply(scarAreas, data.frame, stringsAsFactors=FALSE))

Add files with "Null" reads back in and clean up row names. 添加带有"Null"文件会读回并清理行名称。 Do this only when completely done analyzing data 仅在完成分析数据时才执行此操作

nullDocs = match(unique(data$Document[data$Label=="Null"]),data$Document)
nullDocs = data.frame(data[nullDocs,c("Label","Document","Area")],stringsAsFactors=F)
scarAreas = rbind(nullDocs,scarAreas)
spineAreas = rbind(nullDocs,spineAreas)
row.names(scarAreas)=NULL
row.names(spineAreas)=NULL

Note Well 注意好吧

By adding the "Null" values back in, our Area column will be forced back to the character type since each element in a column must be of the same data type. 通过添加"Null"值,我们的Area列将被强制返回到character类型,因为列中的每个元素必须具有相同的数据类型。 This is important because it means that you cannot really do any meaningful operations in R on your data. 这很重要,因为这意味着您无法在R上对数据执行任何有意义的操作。

For example: spineAreas$Area>scarAreas$Area will return 例如: spineAreas$Area>scarAreas$Area将返回

[1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
[23]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE

Which might lead us to believe that we did not sort our data correctly. 这可能会让我们相信我们没有正确排序数据。

However: as.numeric(spineAreas$Area)>as.numeric(scarAreas$Area) will return 但是: as.numeric(spineAreas$Area)>as.numeric(scarAreas$Area)将返回

[1]   NA   NA   NA TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[28] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

This indicates that the first 3 values where strings (in this case "Null" ) which where replaced by NA and then indicates that our data is correctly sorted. 这表示前3个值,其中字符串(在本例中为"Null" )由NA替换,然后表示我们的数据已正确排序。

So either add the "Null" values back when you are completely done with data analysis, or recast your desired columns to numerics (eg. spineAreas$Area = as.numeric(spineAreas$Area) ) 因此,当您完成数据分析,或者将您想要的列重新编号为数字时,请将"Null"值添加回来(例如, spineAreas$Area = as.numeric(spineAreas$Area)

If you want to avoid this messy typing businesses all together (preferred) 如果你想避免这种混乱的打字业务(首选)

Read in your data so that all "" and "Null" are represented by NA . 读入您的数据,以便所有"""Null"NA表示。 This will make life a lot easier, but will not save you from having to remove the , and cast your data to numeric. 这将使生活变得更加轻松,但不会让您不必删除,并将数据转换为数字。

Here are the lines you would need to change 以下是您需要更改的行

data = read.delim("Scar Ablation Data.tsv",na.strings=c("NA","Null",""),stringsAsFactors=F)
data2 = data[-which(is.na(data$Area)),]
data2 = data2[-which(is.na(data2$Count)),c("Label","Document","Area")]
nullDocs = match(unique(data$Document[is.na(data$Label)]),data$Document)

This will keep your data as numeric even after adding back the null reads and is probably the preferred way to do things. 这将使您的数据保持为numeric即使在添加空读取后也可能是首选的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM