简体   繁体   中英

How to create heatmap in R with proportion and numeric value

I have a dataframe of the number of national postings for several areas of tech/biotech along with the number of posts which coincide with other areas. I wish to create a heatmap showing the intersections (in number of postings) of these fields along with the proportion of these "duplicates." That is, the dataframe itself looks similar to:

df <- data.frame(matrix(nrow=4, byrow=TRUE, data=c(14000, 3300, 
2500, 1000, 3300, 3300, 700, 300, 2500, 700, 95000,7500, 1000, 300, 7500, 108000)))

colnames(df) <- rownames(df) <- c("ML & Image", "Software Dev", "Cloud Dev", "Bioinformatics & Health")

So for example, the first line would start with the total number of ML & Image job postings, followed by the number of ML & Image job postings that also satisfied the conditions to be software developers, followed by the number of ML & Image job postings that satisfied the conditions to be Cloud Developers, etc.

I would like to make a heatmap that looks kind of like the dataframe if you were to view the df table in the R console and maintains the numeric value of the postings, but is colored by the proportion of overlap between the different fields. So it would be colored red(ish) if there was little overlap, yellow(ish) if the overlap was around 30-60%, and green(ish) if there was much overlap, with a colorbar on the side for reference.

Any help on this is much appreciated. Thanks!

Not sure I completely understood what you were asking but the following might give you some ideas.

> library(ggplot2)
> library(reshape2)

# Setup the data                                                                                                                                                                                                                                                              

> df <- data.frame(matrix(nrow=4, byrow=TRUE, data=c(14000, 3300, 2500, 1000, 3300, 3300, 700, 300, 2500, 700, 95000,7500, 1000, 300, 7500, 108000)))
> colnames(df) <- rownames(df) <- c("ML & Image", "Software Dev", "Cloud Dev", "Bioinformatics & Health")

> df
                        ML & Image Software Dev Cloud Dev Bioinformatics & Health
ML & Image                   14000         3300      2500                    1000
Software Dev                  3300         3300       700                     300
Cloud Dev                     2500          700     95000                    7500
Bioinformatics & Health       1000          300      7500                  108000

# Convert df to matrix and divide each column by the diagonal value                                                                                                                                                                                                           

> m <- data.matrix(df)
> m <- m / matrix(t(colSums(diag(4) * m)), nrow=4, ncol=4, byrow=TRUE)

> m
                        ML & Image Software Dev   Cloud Dev Bioinformatics & Health
ML & Image              1.00000000   1.00000000 0.026315789             0.009259259
Software Dev            0.23571429   1.00000000 0.007368421             0.002777778
Cloud Dev               0.17857143   0.21212121 1.000000000             0.069444444
Bioinformatics & Health 0.07142857   0.09090909 0.078947368             1.000000000

# Prepare data for ggplot2 by melting the matrix data in long data and                                                                                                                                                                                                        
# add the posting counts back in to be used as labels                                                                                                                                                                                                                         

> hm <- melt(m)
> hm$postings <- c(df[,1],df[,2],df[,3],df[,4])

> hm
                      Var1                    Var2       value postings
1               ML & Image              ML & Image 1.000000000    14000
2             Software Dev              ML & Image 0.235714286     3300
3                Cloud Dev              ML & Image 0.178571429     2500
4  Bioinformatics & Health              ML & Image 0.071428571     1000
5               ML & Image            Software Dev 1.000000000     3300
6             Software Dev            Software Dev 1.000000000     3300
7                Cloud Dev            Software Dev 0.212121212      700
8  Bioinformatics & Health            Software Dev 0.090909091      300
9               ML & Image               Cloud Dev 0.026315789     2500
10            Software Dev               Cloud Dev 0.007368421      700
11               Cloud Dev               Cloud Dev 1.000000000    95000
12 Bioinformatics & Health               Cloud Dev 0.078947368     7500
13              ML & Image Bioinformatics & Health 0.009259259     1000
14            Software Dev Bioinformatics & Health 0.002777778      300
15               Cloud Dev Bioinformatics & Health 0.069444444     7500
16 Bioinformatics & Health Bioinformatics & Health 1.000000000   108000

# Plot it                                                                                                                                                                                                                                                                     

> ggplot(hm, aes(x=Var1, y=Var2)) +
        geom_tile(aes(fill=value)) +
        scale_fill_gradientn(colours=c("red","yellow","green")) +
        geom_text(aes(label=postings))

Which results in:

在此处输入图片说明

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM