如何使用 ggplot 在 R 散点图中更好地标记数据点

Question

我的 R 经验非常有限。 我正在对大约 11,000 条调查评论进行一些文本分析。 我主要受 Silge & Robinson “Text Mining with R”一书的指导。 反正....

数据集中有几个不同的位置，我将数据拆分为多个帧，分别代表“Location_X”和“Not_X”、“Location_Y”和“Not_Y”等。然后我计算了单词的相对频率（从单个词）并最终得到一个名为 scatter_frequency 的数据框，看起来像

+---------------+--------------+--------------+   
|     word      |  location_x  |    not_x     |  
+---------------+--------------+--------------+  
| acceptance    | 1.538130e-04 | 8.972231e-05 |  
| accepted      | 1.076691e-04 | 1.794446e-04 |  
| accepting     | 1.768850e-04 | 1.794446e-04 |  
| access        | 8.305903e-04 | 8.075008e-04 |  
| accessible    | 1.461224e-04 | 4.486115e-05 |  
| accident      | 7.690651e-06 | 4.486115e-05 |  
| accolades     | 7.690651e-06 | 4.486115e-05 |  
| accommodate   | 2.307195e-05 | 4.486115e-05 |  
| accommodating | 1.538130e-05 | 4.486115e-05 |  
| accomplish    | 4.460578e-04 | 7.626396e-04 |  
| accomplished  | 3.614606e-04 | 3.140281e-04 |  
+---------------+--------------+--------------+

依此类推~4,000 行

然后我绘制

ggplot(scatter_frequency, aes(x=location_x, y=not_x)) +
  geom_abline(color="gray40", lty=2) +
  geom_jitter(alpha=0.1, size=2.5, width=0.3, height=0.3) +
  geom_text(aes(label=word), check_overlap = TRUE, vjust=1.5) +
  scale_x_log10(labels=percent_format()) +
  scale_y_log10(labels=percent_format()) +
  scale_color_gradient(limits=c(0, 0.001),
                      low="darkslategray4", high="gray75") +
  theme(legend.position = "none") +
  labs(x="Location X", y="Not X")

并产生这个情节

你可以看到我模糊了一些识别术语的地方，但这很有代表性。

到目前为止一切顺利……我们现在可以看到哪些术语在一个数据集中出现的频率更高（更靠右）和比另一个更频繁（远离线）。 有趣的是离线最远的术语，因为它们在位置 x 要么明显常见，要么不常见。 线附近的术语并不是那么有趣。 这是一项关于管理的调查，因此出现“领导力”和“管理”也就不足为奇了。 但是，“滥用”在位置 x 比其他位置更常见这一事实很有趣。 我想知道什么词对应于“shop”的下方和左侧的点

所以我的问题是，是否有一种编程方式来限制对那些“有趣”点的标记？ 例如，根据与线的距离选择标记哪些点？

这可能不是最好的问题......提前感谢您的耐心等待。

Answer 1

这是一个计算从点到线的最短距离的解决方案，然后过滤掉那些大于所选阈值的点。

library(ggplot2)
library(scales)

#define the distance formula from a point to the line
#. line has the slope of 1 and intercept of 0
dist<-abs(scatter_frequency$location_x - scatter_frequency$not_x)/sqrt(2)
#determine thershold of distance to plot
toplot <-which(dist>3e-5)

#Edit the geom_text option to use the reduced dataset of labels.
ggplot(scatter_frequency, aes(x=location_x, y=not_x)) +
  geom_abline(color="gray40", lty=2) +
  geom_point(alpha=0.1, size=2.5) +
  geom_text(data=scatter_frequency[toplot,], aes(x=location_x, y=not_x, label=word), check_overlap = TRUE, vjust=1.5) +
  scale_x_log10(labels=percent_format()) +
  scale_y_log10(labels=percent_format()) +
  scale_color_gradient(limits=c(0, 0.001),
                       low="darkslategray4", high="gray75") +
  theme(legend.position = "none") +
  labs(x="Location X", y="Not X")

哪些标签是绘图不正确，但这是由于使用对数对数刻度。

Answer 2

不错的问题。

您应该包含正在使用的软件包，以使示例完整。

你的abline是身份线，所以你认为有趣的点是那些x和y坐标之间差异的绝对值高于某个阈值的点。

您正在使用geom_jitter ，但这会干扰由geom_text_repel完成的geom_text_repel ，我决定使用它来避免重叠并生成将标签连接geom_text_repel线段。 所以我改用geom_point 。

当您将其应用于整个数据集时，您可能需要试验参数nudge_x 、 nudge_y 、 force 、 max.iter和max.iter其他geom_text_repel 。 检查文档。

这是代码：

library(tidyverse)
library(ggrepel)
library(scales)
#> 
#> Attaching package: 'scales'
#> The following object is masked from 'package:purrr':
#> 
#>     discard
#> The following object is masked from 'package:readr':
#> 
#>     col_factor

scatter_frequency <- tibble(
  word = c(
    'acceptance',   
    'accepted',     
    'accepting',    
    'access',       
    'accessible',   
    'accident',     
    'accolades',    
    'accommodate',  
    'accommodating',
    'accomplish',   
    'accomplished'
  ),
  location_x = c(
    1.538130e-04, 
    1.076691e-04, 
    1.768850e-04, 
    8.305903e-04, 
    1.461224e-04, 
    7.690651e-06, 
    7.690651e-06, 
    2.307195e-05, 
    1.538130e-05, 
    4.460578e-04, 
    3.614606e-04
  ),
  not_x = c(
    8.972231e-05, 
    1.794446e-04, 
    1.794446e-04, 
    8.075008e-04, 
    4.486115e-05, 
    4.486115e-05, 
    4.486115e-05, 
    4.486115e-05, 
    4.486115e-05, 
    7.626396e-04, 
    3.140281e-04
  )
)

# Select n points most distant from the line
n <- 5
important <- scatter_frequency %>% 
  mutate(lsqd = (abs(log10(location_x) - log10(not_x)))) %>% 
  top_n(n, wt = lsqd)

ggplot(scatter_frequency, aes(x=location_x, y=not_x)) +
  geom_abline(color="gray40", lty=2) +
  geom_point(alpha=0.1, size=2.5) +
  geom_text_repel(
    data = important,
    aes(label = word),
    min.segment.length = 0,
    # nudge_x = -.5,
    # nudge_y = .5,
    force = 50,
    max.iter = 5000
  ) +
  scale_x_log10(limits = c(.000001, .01), labels=percent_format()) +
  scale_y_log10(limits = c(.000001, .01), labels=percent_format()) +
  scale_color_gradient(limits=c(0, 0.001),
                      low="darkslategray4", high="gray75") +
  theme(legend.position = "none") +
  labs(x="Location X", y="Not X")

^{由reprex 包(v0.3.0) 于 2019 年 12 月 13 日创建}

Answer 3

正如我们昨天讨论的，使用斜率和截距的值，您可以添加一个带有 abline 值的列：

scatter_frequency$reg =  slope * not_x + intercept

然后选择与您感兴趣的线值的距离，并制作具有该距离或更远距离的数据子集：

minDist = 0.2
labeledPoints = subset(scatter_frequency, abs(scatter_frequency$not_x - scatter_frequency$reg)>minDist)

然后使用带有 geom_text 的子集作为标签：

geom_text(data = labeledPoints,aes(label=name), check_overlap = TRUE, vjust=1.5)

您也可以直接创建一个与线的距离的列，并使用它在 geom_test 中创建子集：

scatter_frequency$dist =  abs(scatter_frequency$not_x - (slope * not_x + intercept))
geom_text(data = subset(scatter_frequency, scatter_frequency$dist > minDist),aes(label=name), check_overlap = TRUE, vjust=1.5)

如何使用 ggplot 在 R 散点图中更好地标记数据点

问题描述

3 个解决方案

解决方案1
1 2019-12-13 15:08:22

解决方案2
1 已采纳 2019-12-13 16:00:38

解决方案3
1 2019-12-13 18:06:16

如何使用 ggplot 在 R 散点图中更好地标记数据点

问题描述

3 个解决方案

解决方案1 1 2019-12-13 15:08:22

解决方案2 1 已采纳 2019-12-13 16:00:38

解决方案3 1 2019-12-13 18:06:16

解决方案1
1 2019-12-13 15:08:22

解决方案2
1 已采纳 2019-12-13 16:00:38

解决方案3
1 2019-12-13 18:06:16