Character Frequency from a Vector in R

Question

I have an ebook text file named Frankenstein.txt and I would like to know how many times each letter used in the novel.

My Setup:

I imported the text file, like this inorder to get a vector of characters character_array

string <- readChar("Frankenstein.txt", filesize)
character_array <- unlist(strsplit(string, ""))

character_array gives me something like this.

 "F" "r" "a" "n" "k" "e" "n" "s" "t" "e" "i" "n" "\r", ...

My Goal:

I would like to get the count of each time a character appears in the text file. In other words, I would like to get a count for each unique(character_array)

 [1] "F"  "r"  "a"  "n"  "k"  "e"  "s"  "t"  "i"  "\r" "\n" "b"  "y"  "M" 
 [15] " "  "W"  "o"  "l"  "c"  "f"  "("  "G"  "d"  "w"  ")"  "S"  "h"  "C" 
 [29] "O"  "N"  "T"  "E"  "L"  "1"  "2"  "3"  "4"  "p"  "5"  "6"  "7"  "8" 
 [43] "9"  "0"  "_"  "."  "v"  ","  "g"  "P"  "u"  "D"  "—"  "Y"  "j"  "m" 
 [57] "I"  "z"  "?"  ";"  "x"  "q"  "B"  "U"  "’"  "H"  "-"  "A"  "!"  ":" 
 [71] "R"  "J"  "“"  "”"  "æ"  "V"  "K"  "["  "]"  "‘"  "ê"  "ô"  "é"  "è"

My Attempt When I call plot(as.factor(character_array)) I get a nice graph which gives me what I want visually. However, I need to get the exact values for each of these characters. I would like something like a 2D array like:

    [,1]   [,2] [,3] [,4] ... 
[1,] "a"    "A"  "b"  "B" ...
[2,] "1202" "50" "12" "9" ...

Answer 1

One nice way to make these kinds of text processing pipelines is with magrittr::%>% pipes. Here is one approach, assuming that your text is in "frank.txt" (see bottom for explanation of each step):

library(magrittr)

# read the text in 
frank_txt <- readLines("frank.txt")

# then send the text down this pipeline:
frank_txt %>% 
  paste(collapse="") %>% 
  strsplit(split="") %>% unlist %>% 
  `[`(!. %in% c("", " ", ".", ",")) %>% 
  table %>% 
  barplot

Note that you can just stop at the table() and assign the result to a variable, which you can then manipulate however you want, eg by plotting it:

char_counts <- frank_txt %>% paste(collapse="") %>% 
  strsplit(split="") %>% unlist %>% `[`(!. %in% c("", " ", ".", ",")) %>%
  table

barplot(char_counts)

You can also convert the table into a data frame for easier manipulation/plotting later:

counts_df <- data.frame(
  char = names(char_counts), 
  count = as.numeric(char_counts), 
  stringsAsFactors=FALSE)

head(counts_df)
## char count
##   a    13
##   b     2
##   c     7
##   d     5
##   e    24
##   f     6

Each step explained: Here is the full pipe-chain with each step explained:

# going to send this text down a pipeline:
frank_txt %>% 
  # combine lines into a single string (makes things easier downstream)
  paste(collapse="") %>% 
  # tokenize by character (strsplit returns a list, so unlist it)
  strsplit(split="") %>% unlist %>% 
  # remove instances of characters you don't care about
  `[`(!. %in% c("", " ", ".", ",")) %>% 
  # make a frequency table of the characters
  table %>% 
  # then plot them
  barplot

Note that this is exactly equivalent to the following horrendous ( "monstrous" ?!?!) code -- the forward pipe %>% just applies the function on its right to the value on its left (and . is a pronoun referring to the value on the left; see intro vignette ):

barplot(table(
  unlist(strsplit(paste(frank_txt, collapse=""), split=""))[
    !unlist(strsplit(paste(frank_txt, collapse=""), split="")) %in% 
      c(""," ",".",",")]))

Answer 2

Using gutenbergr, tidytext and dplyr you can do the following:

library(gutenbergr)
library(tidytext)
library(dplyr)

frank <- gutenberg_download(c(84), meta_fields = "title")

Removes unneeded chars like . [ ] etc.

frank %>% 
  unnest_tokens(chars, text, "characters") %>% 
  group_by(chars) %>% 
  summarise(n = n()) %>% 
  t() #transpose to get in order of OP
      [,1]    [,2]    [,3]    [,4]    [,5]    [,6]    [,7]    [,8]    [,9]    [,10]   [,11]   [,12]   [,13]   [,14]   [,15]   [,16]  
chars "0"     "1"     "2"     "3"     "4"     "5"     "6"     "7"     "8"     "9"     "a"     "b"     "c"     "d"     "e"     "f"    
n     "    2" "   35" "   15" "    6" "    4" "    4" "    3" "   16" "    5" "    4" "25733" " 4749" " 8644" "16327" "44210" " 8341"
      [,17]   [,18]   [,19]   [,20]   [,21]   [,22]   [,23]   [,24]   [,25]   [,26]   [,27]   [,28]   [,29]   [,30]   [,31]   [,32]  
chars "g"     "h"     "i"     "j"     "k"     "l"     "m"     "n"     "o"     "p"     "q"     "r"     "s"     "t"     "u"     "v"    
n     " 5564" "19194" "23483" "  413" " 1617" "12239" "10237" "23306" "23886" " 5672" "  313" "19647" "20380" "28835" " 9897" " 3717"
      [,33]   [,34]   [,35]   [,36]  
chars "w"     "x"     "y"     "z"    
n     " 7364" "  649" " 7578" "  239"

If you want these chars, the code is like this:

frank %>% 
  unnest_tokens(chars, text, stringr::str_split, pattern = "") %>% 
  group_by(chars) %>% 
  summarise(n = n()) %>% 
  t() #transpose to get in order of OP

      [,1]    [,2]    [,3]    [,4]    [,5]    [,6]    [,7]    [,8]    [,9]    [,10]   [,11]   [,12]   [,13]   [,14]   [,15]   [,16]  
chars "'"     "-"     " "     "!"     "\""    "("     ")"     ","     "."     ":"     ";"     "?"     "["     "]"     "_"     "0"    
n     "  221" "  370" "71202" "  238" "  774" "   16" "   16" " 4945" " 2904" "   48" "  970" "  220" "    3" "    3" "    2" "    2"
      [,17]   [,18]   [,19]   [,20]   [,21]   [,22]   [,23]   [,24]   [,25]   [,26]   [,27]   [,28]   [,29]   [,30]   [,31]   [,32]  
chars "1"     "2"     "3"     "4"     "5"     "6"     "7"     "8"     "9"     "a"     "b"     "c"     "d"     "e"     "f"     "g"    
n     "   35" "   15" "    6" "    4" "    4" "    3" "   16" "    5" "    4" "25733" " 4749" " 8644" "16327" "44210" " 8341" " 5564"
      [,33]   [,34]   [,35]   [,36]   [,37]   [,38]   [,39]   [,40]   [,41]   [,42]   [,43]   [,44]   [,45]   [,46]   [,47]   [,48]  
chars "h"     "i"     "j"     "k"     "l"     "m"     "n"     "o"     "p"     "q"     "r"     "s"     "t"     "u"     "v"     "w"    
n     "19194" "23483" "  413" " 1617" "12239" "10237" "23306" "23886" " 5672" "  313" "19647" "20380" "28835" " 9897" " 3717" " 7364"
      [,49]   [,50]   [,51]  
chars "x"     "y"     "z"    
n     "  649" " 7578" "  239"

Character Frequency from a Vector in R

Question

2 answers

solution1
4 ACCPTED 2018-03-18 16:38:37

solution2
1 2018-03-18 17:55:30

Character Frequency from a Vector in R

Question

2 answers

solution1 4 ACCPTED 2018-03-18 16:38:37

solution2 1 2018-03-18 17:55:30

solution1
4 ACCPTED 2018-03-18 16:38:37

solution2
1 2018-03-18 17:55:30