Currently, plot is not useful. How would I plot this distribution, since the range is too high?
I have data of 50 year and have to see which activity is most harmful.
The data contain about 1000 unique activity say column1
. I am using groupby(column1)
and summarise(total = sum(column2,column3))
but problem is there few total value in 6 to 7 digit because of these two fact my plot x look bad and due few high value y most value are near x axis.
I believe the problem is at x axis since so many names are clustered together due to less space.
I think a log transformation might help you gain some better insight out of your data:
set.seed(1776) # reproducible random numbers
num_obs <- 10000 # set number of observations
options(scipen = 999) # don't use scientific notation
# don't worry about this code, just creating a reproducible example
y <- abs(rnorm(num_obs) + 2) * abs(rnorm(num_obs) * 50)
make_these_outliers <- runif(num_obs, min=0, max=1) > 0.99
y[make_these_outliers] <- abs(rnorm(sum(make_these_outliers), + 2) *
abs(rnorm(sum(make_these_outliers)) * 50000))
# recreating your current situation
plot(y, main='Ugly Plot')
Now we'll use the log10 transformation on your data an visualize the result. So a value of "10" is now "1", value of "100" is now "2", value of "1000" is now "3", etc.
# log10
plot(log10(y), col= rgb(0, 0, 0, alpha=0.3), pch=16, main='Log Scale and Transparency - Slightly Better')
The pch = 16
argument fills in the points and the alpha = 0.4
sets the opacity of each point. An alpha of 0.4 means an opacity of 40% (can also think of this as 60% transparent).
I'll also show this in ggplot2, because using the scale transformations, ggplot2 is smart enough to put the true value on the y-axis to prevent you from having to do the mental gymnastics of log10 transforms in your head.
# now with ggplot2
# install.packages("ggplot2") # <-- run this if you haven't installed ggplot2 yet
library(ggplot2)
# ggplot2 prefers your data to be in a data.frame (makes it easier to work with)
data_df <- data.frame(
index = 1:num_obs,
y = y)
ggplot(data = data_df, aes(x = index, y = y)) +
geom_point(alpha=0.2) +
scale_y_continuous(trans="log10") +
ggtitle("Y-axis reflects values of the datapoints", "even better?") +
theme_bw(base_size = 12)
At this point, you can start to tell how I've constructed the fake data, which is why there is such a high concentration of points in the 10-1000 range.
Hopefully this helps! I definitely recommend taking PauloH's advice and asking around on stats.stackexchange.com as well to make sure you aren't misrepresenting your data.
Using ggplot2
instead and setting alpha may solve your problem but if that is not enough you may want tag along zoom_facet()
from the ggforce
package.
set.seed(1776)
num_obs <- 10000
options(scipen = 999)
y <- abs(rnorm(num_obs) + 2) * abs(rnorm(num_obs) * 50)
make_these_outliers <- runif(num_obs, min=0, max=1) > 0.99
y[make_these_outliers] <- abs(rnorm(sum(make_these_outliers), + 2) *
abs(rnorm(sum(make_these_outliers)) * 50000))
# install.packages('ggplot2')
library(ggplot2)
# install.packages('ggforce')
library(ggforce)
data_df <- data.frame(
index = 1:num_obs,
y = y)
ggplot(data = data_df, aes(x = index, y = y)) +
geom_point(alpha=0.05) +
facet_zoom(y = (y <= 500), zoom.size = .8) +
theme_bw()
The result would look more or less like the following:
Hope it helps. Check the ggforce
's GitHub:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.