简体   繁体   中英

Removing Special Characters in a Text File in R

I'm using a text file in R and using the readLine function and regexs to extract words from it. The file uses special characters around words (such as # sings before and after a word to show it is bolded or @ before and after a word to show it should be italicized) to indicate special meanings, which are messing up my regexs.

So far this is my r code which removed all empty lines and then combined my text file into a single vector :

    book<-readLines("/Users/Desktop/SAMPLE .txt",encoding="UTF-8")
    #remove all empty lines
    empty_lines = grepl('^\\s*$', book)
    book = book[! empty_lines]
    #combine book into one variable
    xBook = paste(book, collapse = '')
    #remove extra white spaces for a single text of the entire book
    updated<-trimws(gsub("\\s+"," ",xBook))

when i run updated, i see the entire file stored in the variable updated but with the special characters:

updated [1] "It is a truth universally acknowledged, that a #single# man in possession of a good fortune, must be in want of a wife. However little known the feelings or views of such a @man@ may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, @that@ he is considered the rightful property of some one or other of #their# daughters.

How can I remove all all the leading or trailing # or @ from the words in my updated variable?

my desired output is just the plain text, with no indication of words that should be bolded or italicized:

updated [1] "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife. However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.

gsub("[@#]([a-zA-Z]+)[@#]", "\\1", x)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM