繁体   English   中英

R中的文本提取与stringi包

[英]Text Extraction in R with stringi package

我有下面的文字,需要在特定单词之前和之后提取特定单词

例:

sometext <- "about us, close, products & services, focus, close, research & development, topics, carbon fiber reinforced thermoplastic, separators for lithium ion batteries, close, for investors, close, jobs & careers, close, \nselect language\n\n, home > corporate social responsibility > \nsocial report\n >  quality assurance\n, \nensuring provision of safe products, \nthe teijin group resin & plastic processing business unit is globally expanding its engineering plastics centered on polycarbonate resin, where we hold a major share in growing asian markets. these products are widely used in applications such as automotive components, office automation equipment and optical discs (blu-ray, dvd). customers include automotive manufacturers, electronic equipment manufacturers and related mold companies. customer data is organized into a database as groundwork to actively promote efforts to enhance customer satisfaction., \nin accordance with iso 9001 (8-4, 8-2), the regular implementation of"
library(stringi)
stri_extract_all_fixed(sometext , c('engineering plastics', 'iso 9001','office automation'), case_insensitive=TRUE, overlap=TRUE)

实际输出如下

[[1]]
[1] "engineering plastics"

[[2]]
[1] "iso 9001"

[[3]]
[1] "office automation"

所需输出:

[1] globally expanding its engineering plastics centered on polycarbonate resin
[2] accordance with iso 9001 (8-4, 8-2), the regular implementation of

基本上需要在我提到的具体单词之前和之后提取文本

这是一个开头的想法:

sometext <- "about us, close, products & services, focus, close, research & development, topics, carbon fiber reinforced thermoplastic, separators for lithium ion batteries, close, for investors, close, jobs & careers, close, \nselect language\n\n, home > corporate social responsibility > \nsocial report\n >  quality assurance\n, \nensuring provision of safe products, \nthe teijin group resin & plastic processing business unit is globally expanding its engineering plastics centered on polycarbonate resin, where we hold a major share in growing asian markets. these products are widely used in applications such as automotive components, office automation equipment and optical discs (blu-ray, dvd). customers include automotive manufacturers, electronic equipment manufacturers and related mold companies. customer data is organized into a database as groundwork to actively promote efforts to enhance customer satisfaction., \nin accordance with iso 9001 (8-4, 8-2), the regular implementation of"
library(stringi)
words <- c('engineering plastics', 'iso 9001','office automation')
pattern <- stri_paste("([^ ]+ ){0,10}", words, "([^ ]+ ){0,10}")
stri_extract_all_regex(sometext , pattern, case_insensitive=TRUE, overlap=TRUE)

说明:我在您想要的单词之前和之后添加简单的正则表达式:

"([^ ]+ ){0,10}"

意思是:

  1. 除了空间之外的任何东西,尽可能多地重复
  2. 空间
  3. 所有这一切多达十次

这非常简单和天真(例如它将所有'&'或'>'视为单词)但是有效。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM