简体   繁体   中英

Removing text contained in brackets/parentheses from corpus (R)

I have a corpus of many documents, containing long texts. I want to tokenize this corpus for further analysis, however, the texts contain irrelevant data within parentheses (typically references, such as:"(example example)"), so I want to delete them. I have found methods on stackoverflow for text objects, however, I don't know how can I apply this for a corpus (words between the parentheses would be considered as independent tokens and not removed by regex?). I've figured out that I should do it before I remove punctuation (as the latter also removes parentheses).

Could you help me with this? Thank you in advance!

I only reached the regex: "\\( . \\)"

You can remove all texts in brackets using gsub() . As you plan to remove the punctuation in a next step, you can replace them with . , just to indicate where something was taken (if you need to debug the pipeline) or you can replace them with an empty string "" .

Your regex would not work. You need to escape the brackets with double back-slashes and you will want to remove multiple, but as few as possible, characters. You'll need the regex *? for the contents of the brackets:

corp = c("This is an example (or demonstration) of replacing things in brackets",
         "Just use gsub (a function in base) to remove (or better replace) these elements")

corp = gsub("\\(.*?\\)",".",corp)

The example above would result in the vector:

> corp
[1] "This is an example . of replacing things in brackets"
[2] "Just use gsub . to remove . these elements"     

Depending on the package you use for your corpus, you can do this with the character vector before converting it to a corpus or you can use specific mapping functions (eg tm_map() in tm ) to apply it to all texts.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM