[英]R text mining - remove special characters and quotes
我正在 R 中執行文本挖掘任務。
1) 計算句子
2) 在向量中識別和保存引號
像“...”這樣的假句號和像“Mr.”這樣的標題中的句號必須處理。
文本正文數據中肯定有引號,並且其中會有“...”。 我想從主體中提取這些引號並將它們保存在一個向量中。 (也需要對它們進行一些操作。)
重要提示:我的文本數據在 Word 文檔中。 我使用 readtext("path to .docx file") 在 R 中加載。當我查看文本時,與可重現的文本相反,引號只是“但不是 \\”。
path <- "C:/Users/.../"
a <- readtext(paste(path, "Text.docx", sep = ""))
title <- a$doc_id
text <- a$text
text <- "Mr. and Mrs. Keyboard have two children. Keyboard Jr. and Miss. Keyboard. ...
However, Miss. Keyboard likes being called Miss. K [Miss. Keyboard is a bit of a princess ...]
\"Mom how are you o.k. with being called Mrs. Keyboard? I'll never get it...\". "
# splitting by "."
unlist(strsplit(text, "\\."))
問題是它被錯誤的句號分裂了我試過的解決方案:
# getting rid of . in titles
vec <- c("Mr.", "Mrs.", "Ms.", "Miss.", "Dr.", "Jr.")
vec.rep <- c("Mr", "Mrs", "Ms", "Miss", "Dr", "Jr")
library(gsubfn)
# replacing . in titles
gsubfn("\\S+", setNames(as.list(vec.rep), vec), text)
問題在於它並沒有取代 [Miss. by [小姐
識別報價:
stri_extract_all_regex(text, '"\\S+"')
但這也行不通。 (它與 \\" 與下面的代碼一起使用)
stri_extract_all_regex("some text \"quote\" some other text", '"\\S+"')
確切的預期向量是:
sentences <- c("Mr and Mrs Keyboard have two children. ", "Keyboard Jr and Miss Keyboard.", "However, Miss Keyboard likes being called Miss K [Miss Keyboard is a bit of a princess ...]", ""Mom how are you ok with being called Mrs Keyboard? I'll never get it...""
我希望將句子分開(這樣我就可以計算出每個段落中有多少個句子)。 和引號也分開了。
quotes <- ""Mom how are you ok with being called Mrs Keyboard? I'll never get it...""
您可以使用匹配所有當前vec
值
gsubfn("\\w+\\.", setNames(as.list(vec.rep), vec), text)
也就是說, \\w+
匹配 1 個或多個單詞字符和\\.
匹配一個點。
接下來,如果您只想提取引號,請使用
regmatches(text, gregexpr('"[^"]*"', text))
"
匹配 a "
和[^"]*
匹配 0 個或多個字符而不是"
。
如果您打算將句子與引號匹配,您可以考慮
regmatches(text, gregexpr('\\s*"[^"]*"|[^"?!.]+[[:space:]?!.]+[^"[:alnum:]]*', trimws(text)))
細節
\\\\s*
- 0+ 個空格"[^"]*"
- 一個"
,除"
和"
之外的 0+ 個字符|
- 或者[^"?!.]+
- 0+ 個字符,除了?
, "
, !
和.
[[:space:]?!.]+
- 1 個或多個空格, ?
, !
或.
字符[^"[:alnum:]]*
- 0+ 非字母數字和"
字符R 示例代碼:
> vec <- c("Mr.", "Mrs.", "Ms.", "Miss.", "Dr.", "Jr.")
> vec.rep <- c("Mr", "Mrs", "Ms", "Miss", "Dr", "Jr")
> library(gsubfn)
> text <- gsubfn("\\w+\\.", setNames(as.list(vec.rep), vec), text)
> regmatches(text, gregexpr('\\s*"[^"]*"|[^"?!.]+[[:space:]?!.]+[^"[:alnum:]]*', trimws(text)))
[[1]]
[1] "Mr and Mrs Keyboard have two children. "
[2] "Keyboard Jr and Miss Keyboard. ... \n"
[3] "However, Miss Keyboard likes being called Miss K [Miss Keyboard is a bit of a princess ...]\n "
[4] "\"Mom how are you o.k. with being called Mrs Keyboard? I'll never get it...\""
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.