繁体   English   中英

R 文本挖掘 - 删除特殊字符和引号

[英]R text mining - remove special characters and quotes

我正在 R 中执行文本挖掘任务。

任务:

1) 计算句子

2) 在向量中识别和保存引号

问题 :

像“...”这样的假句号和像“Mr.”这样的标题中的句号必须处理。

文本正文数据中肯定有引号,并且其中会有“...”。 我想从主体中提取这些引号并将它们保存在一个向量中。 (也需要对它们进行一些操作。)

重要提示:我的文本数据在 Word 文档中。 我使用 readtext("path to .docx file") 在 R 中加载。当我查看文本时,与可重现的文本相反,引号只是“但不是 \\”。

path <- "C:/Users/.../"
a <- readtext(paste(path, "Text.docx", sep = ""))
title <- a$doc_id
text <- a$text

可复制的文本

text <- "Mr. and Mrs. Keyboard have two children. Keyboard Jr. and Miss. Keyboard. ... 
However, Miss. Keyboard likes being called Miss. K [Miss. Keyboard is a bit of a princess ...]
 \"Mom how are you o.k. with being called Mrs. Keyboard? I'll never get it...\". "


#  splitting by "." 
unlist(strsplit(text, "\\."))

问题是它被错误的句号分裂了我试过的解决方案:

# getting rid of . in titles 
vec <- c("Mr.", "Mrs.", "Ms.", "Miss.", "Dr.", "Jr.")
vec.rep <- c("Mr", "Mrs", "Ms", "Miss", "Dr", "Jr")

library(gsubfn)
# replacing . in titles
gsubfn("\\S+", setNames(as.list(vec.rep), vec), text)

问题在于它并没有取代 [Miss. by [小姐

识别报价:

stri_extract_all_regex(text, '"\\S+"')

但这也行不通。 (它与 \\" 与下面的代码一起使用)

stri_extract_all_regex("some text \"quote\" some other text", '"\\S+"')

确切的预期向量是:

sentences <- c("Mr and Mrs Keyboard have two children. ", "Keyboard Jr and Miss Keyboard.", "However, Miss Keyboard likes being called Miss K [Miss Keyboard is a bit of a princess ...]", ""Mom how are you ok with being called Mrs Keyboard? I'll never get it...""

我希望将句子分开(这样我就可以计算出每个段落中有多少个句子)。 和引号也分开了。

quotes <- ""Mom how are you ok with being called Mrs Keyboard? I'll never get it...""

您可以使用匹配所有当前vec

gsubfn("\\w+\\.", setNames(as.list(vec.rep), vec), text)

也就是说, \\w+匹配 1 个或多个单词字符和\\. 匹配一个点。

接下来,如果您只想提取引号,请使用

regmatches(text, gregexpr('"[^"]*"', text))

"匹配 a "[^"]*匹配 0 个或多个字符而不是"

如果您打算将句子与引号匹配,您可以考虑

regmatches(text, gregexpr('\\s*"[^"]*"|[^"?!.]+[[:space:]?!.]+[^"[:alnum:]]*', trimws(text)))

细节

  • \\\\s* - 0+ 个空格
  • "[^"]*" - 一个" ,除""之外的 0+ 个字符
  • | - 或者
  • [^"?!.]+ - 0+ 个字符,除了? , " , ! .
  • [[:space:]?!.]+ - 1 个或多个空格, ? ! . 字符
  • [^"[:alnum:]]* - 0+ 非字母数字和"字符

R 示例代码:

> vec <- c("Mr.", "Mrs.", "Ms.", "Miss.", "Dr.", "Jr.")
> vec.rep <- c("Mr", "Mrs", "Ms", "Miss", "Dr", "Jr")
> library(gsubfn)
> text <- gsubfn("\\w+\\.", setNames(as.list(vec.rep), vec), text)
> regmatches(text, gregexpr('\\s*"[^"]*"|[^"?!.]+[[:space:]?!.]+[^"[:alnum:]]*', trimws(text)))
[[1]]
[1] "Mr and Mrs Keyboard have two children. "                                                       
[2] "Keyboard Jr and Miss Keyboard. ... \n"                                                         
[3] "However, Miss Keyboard likes being called Miss K [Miss Keyboard is a bit of a princess ...]\n "
[4] "\"Mom how are you o.k. with being called Mrs Keyboard? I'll never get it...\"" 

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM