I have a corpus of many documents, containing long texts. I want to tokenize this corpus for further analysis, however, the texts contain irrelevant data within parentheses (typically references, such as:"(example example)"), so I want to delete them. I have found methods on stackoverflow for text objects, however, I don't know how can I apply this for a corpus (words between the parentheses would be considered as independent tokens and not removed by regex?). I've figured out that I should do it before I remove punctuation (as the latter also removes parentheses).
Could you help me with this? Thank you in advance!
I only reached the regex: "\\( . \\)"
You can remove all texts in brackets using gsub()
. As you plan to remove the punctuation in a next step, you can replace them with .
, just to indicate where something was taken (if you need to debug the pipeline) or you can replace them with an empty string ""
.
Your regex would not work. You need to escape the brackets with double back-slashes and you will want to remove multiple, but as few as possible, characters. You'll need the regex *?
for the contents of the brackets:
corp = c("This is an example (or demonstration) of replacing things in brackets",
"Just use gsub (a function in base) to remove (or better replace) these elements")
corp = gsub("\\(.*?\\)",".",corp)
The example above would result in the vector:
> corp
[1] "This is an example . of replacing things in brackets"
[2] "Just use gsub . to remove . these elements"
Depending on the package you use for your corpus, you can do this with the character vector before converting it to a corpus or you can use specific mapping functions (eg tm_map()
in tm
) to apply it to all texts.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.