I have a dataframe of the number of national postings for several areas of tech/biotech along with the number of posts which coincide with other areas. I wish to create a heatmap showing the intersections (in number of postings) of these fields along with the proportion of these "duplicates." That is, the dataframe itself looks similar to:
df <- data.frame(matrix(nrow=4, byrow=TRUE, data=c(14000, 3300,
2500, 1000, 3300, 3300, 700, 300, 2500, 700, 95000,7500, 1000, 300, 7500, 108000)))
colnames(df) <- rownames(df) <- c("ML & Image", "Software Dev", "Cloud Dev", "Bioinformatics & Health")
So for example, the first line would start with the total number of ML & Image job postings, followed by the number of ML & Image job postings that also satisfied the conditions to be software developers, followed by the number of ML & Image job postings that satisfied the conditions to be Cloud Developers, etc.
I would like to make a heatmap that looks kind of like the dataframe if you were to view the df table in the R console and maintains the numeric value of the postings, but is colored by the proportion of overlap between the different fields. So it would be colored red(ish) if there was little overlap, yellow(ish) if the overlap was around 30-60%, and green(ish) if there was much overlap, with a colorbar on the side for reference.
Any help on this is much appreciated. Thanks!
Not sure I completely understood what you were asking but the following might give you some ideas.
> library(ggplot2)
> library(reshape2)
# Setup the data
> df <- data.frame(matrix(nrow=4, byrow=TRUE, data=c(14000, 3300, 2500, 1000, 3300, 3300, 700, 300, 2500, 700, 95000,7500, 1000, 300, 7500, 108000)))
> colnames(df) <- rownames(df) <- c("ML & Image", "Software Dev", "Cloud Dev", "Bioinformatics & Health")
> df
ML & Image Software Dev Cloud Dev Bioinformatics & Health
ML & Image 14000 3300 2500 1000
Software Dev 3300 3300 700 300
Cloud Dev 2500 700 95000 7500
Bioinformatics & Health 1000 300 7500 108000
# Convert df to matrix and divide each column by the diagonal value
> m <- data.matrix(df)
> m <- m / matrix(t(colSums(diag(4) * m)), nrow=4, ncol=4, byrow=TRUE)
> m
ML & Image Software Dev Cloud Dev Bioinformatics & Health
ML & Image 1.00000000 1.00000000 0.026315789 0.009259259
Software Dev 0.23571429 1.00000000 0.007368421 0.002777778
Cloud Dev 0.17857143 0.21212121 1.000000000 0.069444444
Bioinformatics & Health 0.07142857 0.09090909 0.078947368 1.000000000
# Prepare data for ggplot2 by melting the matrix data in long data and
# add the posting counts back in to be used as labels
> hm <- melt(m)
> hm$postings <- c(df[,1],df[,2],df[,3],df[,4])
> hm
Var1 Var2 value postings
1 ML & Image ML & Image 1.000000000 14000
2 Software Dev ML & Image 0.235714286 3300
3 Cloud Dev ML & Image 0.178571429 2500
4 Bioinformatics & Health ML & Image 0.071428571 1000
5 ML & Image Software Dev 1.000000000 3300
6 Software Dev Software Dev 1.000000000 3300
7 Cloud Dev Software Dev 0.212121212 700
8 Bioinformatics & Health Software Dev 0.090909091 300
9 ML & Image Cloud Dev 0.026315789 2500
10 Software Dev Cloud Dev 0.007368421 700
11 Cloud Dev Cloud Dev 1.000000000 95000
12 Bioinformatics & Health Cloud Dev 0.078947368 7500
13 ML & Image Bioinformatics & Health 0.009259259 1000
14 Software Dev Bioinformatics & Health 0.002777778 300
15 Cloud Dev Bioinformatics & Health 0.069444444 7500
16 Bioinformatics & Health Bioinformatics & Health 1.000000000 108000
# Plot it
> ggplot(hm, aes(x=Var1, y=Var2)) +
geom_tile(aes(fill=value)) +
scale_fill_gradientn(colours=c("red","yellow","green")) +
geom_text(aes(label=postings))
Which results in:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.