简体   繁体   English

在R中绘制一百万点?

[英]Plotting a million points in R?

i have a text file (tab delimited) and it has 3 columns A, B, C: 我有一个文本文件(制表符分隔),它具有3列A,B,C:

       A                          B                           C
0.07142857142857142      0.35714285714285715    0.21428571428571427
0.0                      0.3333333333333333     0.3888888888888889
0.07142857142857142      0.35714285714285715    0.21428571428571427
0.0                      0.3333333333333333         0.3888888888888889

Each row represents a sample with 3 different percentages A, B and C. In total I have 4 files for 4 different organisms. 每行代表一个具有3个不同百分比的A,B和C的样本。总共我有4个文件,分别针对4种不同的生物。 There can be more than a million rows per file. 每个文件可以有超过一百万行。

My idea is to plot each row in order to see the distribution of the pairs of points (A,B,C) in a given file and then to determine what is the most frequent pair in a given file and then compare the 4 files. 我的想法是绘制每行以查看给定文件中的点对(A,B,C)的分布,然后确定给定文件中最频繁的点对是什么,然后比较这四个文件。

I tried plotting these points in R (multi-curves in a same graph: A, B, C in the y axis and the number of sample in the x axis) for each file but there are so many points that basically the graph can't be interpreted. 我尝试为每个文件在R中绘制这些点(同一图中的多个曲线:y轴中的A,B,C,x轴中的样本数),但是有太多的点,基本上该图可以不能被解释。 Also for the million rows file, R crashes and won't plot the points. 同样对于百万行文件,R崩溃并且不会绘制点。

What would be the best approach to represent these points? 代表这些观点的最佳方法是什么? Also is the mode function enough to determine the most frequent pair (A,B,C) or is there any appropriate statistic test I could try to do so? 模式功能是否足以确定最频繁的货币对(A,B,C),或者我可以尝试进行任何适当的统计检验吗?

Any help would be much appreciated. 任何帮助将非常感激。

Thanks. 谢谢。

As I mentioned in my comment, clustering may be a solution to your problem. 正如我在评论中提到的那样, 群集可能是解决您的问题的方法。 Here is one way of clustering using kmeans : 这是使用kmeans进行聚类的一种方法:

irisCl <- transform(iris, Cluster = kmeans(iris[1:4],3)$cluster)
library(ggplot2)
qplot(Sepal.Length, Sepal.Width, data=irisCl, colour=Species) + facet_grid(~Cluster)

kmeans

Note that we have clustered in a 4-dimensional variable space. 请注意,我们已经聚集在4维变量空间中。 As you can see, the setosa are identified correctly in the first cluster, the second cluster contains only virginica, but the third cluster contains a mixture of versicolor and virginica. 如您所见,在第一个群集中正确识别了刚毛,第二个群集仅包含弗吉尼亚州,但是第三个群集包含杂色和弗吉尼亚州的混合物。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM