在三維散點圖中標記異常值上的數據標簽

Question

我有一個標簽分隔的數據集，看起來像這樣

Labels  t1  t2  t3
gene1   0.000000E+00    0.000000E+00    1.138501E-01
gene2   0.000000E+00    0.000000E+00    9.550272E-02
gene3   0.000000E+00    1.851936E-02    1.019907E-01
gene4   8.212816E-02    0.000000E+00    6.570984E+00
gene5   1.282434E-01    0.000000E+00    6.240799E+00
gene6   2.918929E-01    8.453281E-01    3.387610E+00
gene7   0.000000E+00    1.923038E-01    0.000000E+00
gene8   1.135057E+00    0.000000E+00    2.491100E+00
gene9   7.935625E-01    1.070320E-01    2.439292E+00
gene10  5.046790E+00    0.000000E+00    2.459273E+00
gene11  3.293614E-01    0.000000E+00    2.380152E+00
gene12  0.000000E+00    0.000000E+00    1.474757E-01
gene13  0.000000E+00    0.000000E+00    1.521591E-01
gene14  0.000000E+00    9.968809E-02    8.387166E-01
gene15  0.000000E+00    1.065761E-01    0.000000E+00

我想要的是：獲得帶有異常值標簽的三維散點圖，如下所示：

在此輸入圖像描述

我做了什么：在R

我實際上已經單獨閱讀了每一列，如下所示：

library("scatterplot3d")
temp<-read.table("tempdata.txt", header=T)
scatterplot3d(temp1$t1, temp1$t2, temp1$t3)

我想要的是：異常值的標簽應至少顯示在前250名中，或者如何在變量中獲得前250個異常值的這些標簽以供進一步分析。

有人可以在R中指導我。

python的解決方案也很受歡迎。

Answer 1

將250個標簽繪制成繪圖並不是一個好選擇，因為它會使繪圖無法讀取。 如果要在繪圖中標記異常值，則應遠離其他數據點，以便輕松識別它們。 但是，您可以將最大的250 zz值及其相應的標簽保存在矩陣中以供進一步分析。 我會做這樣的事情：

# Create some random data
library("scatterplot3d")
temp1 <- as.data.frame(matrix(rnorm(900), ncol=3))
temp1$labels <- c("gen1", "gen2", "gen3")
colnames(temp1) <- c("t1", "t2", "t3", "labels")

# get the outliers
zz.outlier <- sort(temp1$t3, TRUE)[1:5]
ix <- which(temp1$t3 %in% zz.outlier)
outlier.matrix <- temp1[ix, ]

# create the plot and mark the points
sd3 <- scatterplot3d(temp1$t1, temp1$t2, temp1$t3)
sd3$points3d(temp1$t1[ix],temp1$t2[ix],temp1$t2[ix], col="red")
text(sd3$xyz.convert(temp1$t1[ix],temp1$t2[ix],temp1$t2[ix]), 
     labels=temp1$labels[ix])

在這里，我還用紅色標記了這些點。 這將允許您標記比使用文本標簽更大量的異常值，同時仍然可以保持繪圖的可訪問性。 但是，如果附近有多個點，它也會失敗。

Answer 2

這是在matplotlib：

import numpy as np
from matplotlib import pyplot, cm
from mpl_toolkits.mplot3d import Axes3D

data = np.genfromtxt('genes.txt', usecols=range(1,4))
N = len(data)
nout = N/4   # top 25% in magnitude
outliers = np.argsort(np.sqrt(np.sum(data**2, 1)))[-nout:]
outlies = np.zeros(N)
outlies[outliers] = 1   # now an array of 0 or 1, depending on whether an outlier

fig = pyplot.figure()
ax = fig.add_subplot(111, projection='3d')

ax.scatter(*data.T, c=cm.jet(outlies)) # color by whether outlies.
pyplot.show()

在這里，紅色遠離原點，附近有藍色：

在三維散點圖中標記異常值上的數據標簽

問題描述

2 個解決方案

解決方案1
1 已采納 2013-05-07 22:00:19

解決方案2
1 2013-05-07 23:22:38

在三維散點圖中標記異常值上的數據標簽

問題描述

2 個解決方案

解決方案1 1 已采納 2013-05-07 22:00:19

解決方案2 1 2013-05-07 23:22:38

解決方案1
1 已采納 2013-05-07 22:00:19

解決方案2
1 2013-05-07 23:22:38