I have an ebook text file named Frankenstein.txt
and I would like to know how many times each letter used in the novel.
My Setup:
I imported the text file, like this inorder to get a vector of characters character_array
string <- readChar("Frankenstein.txt", filesize)
character_array <- unlist(strsplit(string, ""))
character_array
gives me something like this.
"F" "r" "a" "n" "k" "e" "n" "s" "t" "e" "i" "n" "\r", ...
My Goal:
I would like to get the count of each time a character appears in the text file. In other words, I would like to get a count for each unique(character_array)
[1] "F" "r" "a" "n" "k" "e" "s" "t" "i" "\r" "\n" "b" "y" "M"
[15] " " "W" "o" "l" "c" "f" "(" "G" "d" "w" ")" "S" "h" "C"
[29] "O" "N" "T" "E" "L" "1" "2" "3" "4" "p" "5" "6" "7" "8"
[43] "9" "0" "_" "." "v" "," "g" "P" "u" "D" "—" "Y" "j" "m"
[57] "I" "z" "?" ";" "x" "q" "B" "U" "’" "H" "-" "A" "!" ":"
[71] "R" "J" "“" "”" "æ" "V" "K" "[" "]" "‘" "ê" "ô" "é" "è"
My Attempt When I call plot(as.factor(character_array))
I get a nice graph which gives me what I want visually. However, I need to get the exact values for each of these characters. I would like something like a 2D array like:
[,1] [,2] [,3] [,4] ...
[1,] "a" "A" "b" "B" ...
[2,] "1202" "50" "12" "9" ...
One nice way to make these kinds of text processing pipelines is with magrittr::%>%
pipes. Here is one approach, assuming that your text is in "frank.txt"
(see bottom for explanation of each step):
library(magrittr)
# read the text in
frank_txt <- readLines("frank.txt")
# then send the text down this pipeline:
frank_txt %>%
paste(collapse="") %>%
strsplit(split="") %>% unlist %>%
`[`(!. %in% c("", " ", ".", ",")) %>%
table %>%
barplot
Note that you can just stop at the table()
and assign the result to a variable, which you can then manipulate however you want, eg by plotting it:
char_counts <- frank_txt %>% paste(collapse="") %>%
strsplit(split="") %>% unlist %>% `[`(!. %in% c("", " ", ".", ",")) %>%
table
barplot(char_counts)
You can also convert the table into a data frame for easier manipulation/plotting later:
counts_df <- data.frame(
char = names(char_counts),
count = as.numeric(char_counts),
stringsAsFactors=FALSE)
head(counts_df)
## char count
## a 13
## b 2
## c 7
## d 5
## e 24
## f 6
Each step explained: Here is the full pipe-chain with each step explained:
# going to send this text down a pipeline:
frank_txt %>%
# combine lines into a single string (makes things easier downstream)
paste(collapse="") %>%
# tokenize by character (strsplit returns a list, so unlist it)
strsplit(split="") %>% unlist %>%
# remove instances of characters you don't care about
`[`(!. %in% c("", " ", ".", ",")) %>%
# make a frequency table of the characters
table %>%
# then plot them
barplot
Note that this is exactly equivalent to the following horrendous ( "monstrous" ?!?!) code -- the forward pipe %>%
just applies the function on its right to the value on its left (and .
is a pronoun referring to the value on the left; see intro vignette ):
barplot(table(
unlist(strsplit(paste(frank_txt, collapse=""), split=""))[
!unlist(strsplit(paste(frank_txt, collapse=""), split="")) %in%
c(""," ",".",",")]))
Using gutenbergr, tidytext and dplyr you can do the following:
library(gutenbergr)
library(tidytext)
library(dplyr)
frank <- gutenberg_download(c(84), meta_fields = "title")
Removes unneeded chars like . [ ] etc.
frank %>%
unnest_tokens(chars, text, "characters") %>%
group_by(chars) %>%
summarise(n = n()) %>%
t() #transpose to get in order of OP
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
chars "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" "a" "b" "c" "d" "e" "f"
n " 2" " 35" " 15" " 6" " 4" " 4" " 3" " 16" " 5" " 4" "25733" " 4749" " 8644" "16327" "44210" " 8341"
[,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32]
chars "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v"
n " 5564" "19194" "23483" " 413" " 1617" "12239" "10237" "23306" "23886" " 5672" " 313" "19647" "20380" "28835" " 9897" " 3717"
[,33] [,34] [,35] [,36]
chars "w" "x" "y" "z"
n " 7364" " 649" " 7578" " 239"
If you want these chars, the code is like this:
frank %>%
unnest_tokens(chars, text, stringr::str_split, pattern = "") %>%
group_by(chars) %>%
summarise(n = n()) %>%
t() #transpose to get in order of OP
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
chars "'" "-" " " "!" "\"" "(" ")" "," "." ":" ";" "?" "[" "]" "_" "0"
n " 221" " 370" "71202" " 238" " 774" " 16" " 16" " 4945" " 2904" " 48" " 970" " 220" " 3" " 3" " 2" " 2"
[,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32]
chars "1" "2" "3" "4" "5" "6" "7" "8" "9" "a" "b" "c" "d" "e" "f" "g"
n " 35" " 15" " 6" " 4" " 4" " 3" " 16" " 5" " 4" "25733" " 4749" " 8644" "16327" "44210" " 8341" " 5564"
[,33] [,34] [,35] [,36] [,37] [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48]
chars "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w"
n "19194" "23483" " 413" " 1617" "12239" "10237" "23306" "23886" " 5672" " 313" "19647" "20380" "28835" " 9897" " 3717" " 7364"
[,49] [,50] [,51]
chars "x" "y" "z"
n " 649" " 7578" " 239"
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.