简体   繁体   中英

scatter plot in r with huge unique observation

Currently, plot is not useful. How would I plot this distribution, since the range is too high?

I have data of 50 year and have to see which activity is most harmful.

The data contain about 1000 unique activity say column1 . I am using groupby(column1) and summarise(total = sum(column2,column3)) but problem is there few total value in 6 to 7 digit because of these two fact my plot x look bad and due few high value y most value are near x axis.

情节SS

I believe the problem is at x axis since so many names are clustered together due to less space.

I think a log transformation might help you gain some better insight out of your data:

Set up some fake data that resembles your situation:

set.seed(1776)        # reproducible random numbers
num_obs <- 10000      # set number of observations
options(scipen = 999) # don't use scientific notation

# don't worry about this code, just creating a reproducible example
y <- abs(rnorm(num_obs) + 2) * abs(rnorm(num_obs) * 50)
make_these_outliers <- runif(num_obs, min=0, max=1) > 0.99
y[make_these_outliers] <- abs(rnorm(sum(make_these_outliers), + 2) * 
abs(rnorm(sum(make_these_outliers)) * 50000))

Recreate the plot you have now to show the issue you're facing:

# recreating your current situation
plot(y, main='Ugly Plot')

丑陋的阴谋

Log10 transformation

Now we'll use the log10 transformation on your data an visualize the result. So a value of "10" is now "1", value of "100" is now "2", value of "1000" is now "3", etc.

# log10
plot(log10(y), col= rgb(0, 0, 0, alpha=0.3), pch=16, main='Log Scale and Transparency - Slightly Better')

log10_base_R

The pch = 16 argument fills in the points and the alpha = 0.4 sets the opacity of each point. An alpha of 0.4 means an opacity of 40% (can also think of this as 60% transparent).

ggplot2

I'll also show this in ggplot2, because using the scale transformations, ggplot2 is smart enough to put the true value on the y-axis to prevent you from having to do the mental gymnastics of log10 transforms in your head.

# now with ggplot2 
# install.packages("ggplot2")    # <-- run this if you haven't installed ggplot2 yet
library(ggplot2)

# ggplot2 prefers your data to be in a data.frame (makes it easier to work with)
data_df <- data.frame(
    index = 1:num_obs,
    y = y)


ggplot(data = data_df, aes(x = index, y = y)) +
    geom_point(alpha=0.2) +
    scale_y_continuous(trans="log10") +
    ggtitle("Y-axis reflects values of the datapoints", "even better?") +
    theme_bw(base_size = 12)

在此处输入图片说明

At this point, you can start to tell how I've constructed the fake data, which is why there is such a high concentration of points in the 10-1000 range.

Hopefully this helps! I definitely recommend taking PauloH's advice and asking around on stats.stackexchange.com as well to make sure you aren't misrepresenting your data.

Using ggplot2 instead and setting alpha may solve your problem but if that is not enough you may want tag along zoom_facet() from the ggforce package.

set.seed(1776)      
num_obs <- 10000     
options(scipen = 999) 

y <- abs(rnorm(num_obs) + 2) * abs(rnorm(num_obs) * 50)
make_these_outliers <- runif(num_obs, min=0, max=1) > 0.99
y[make_these_outliers] <- abs(rnorm(sum(make_these_outliers), + 2) * 
                                abs(rnorm(sum(make_these_outliers)) * 50000))

# install.packages('ggplot2')
library(ggplot2)
# install.packages('ggforce')
library(ggforce)

data_df <- data.frame(
  index = 1:num_obs,
  y = y)


ggplot(data = data_df, aes(x = index, y = y)) +
  geom_point(alpha=0.05) +
  facet_zoom(y = (y <= 500), zoom.size = .8) +
  theme_bw()

The result would look more or less like the following: 在此处输入图片说明

Hope it helps. Check the ggforce 's GitHub:

https://github.com/thomasp85/ggforce

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM