简体   繁体   English

从字符串中提取“单词”

[英]Extract “words” from a string

I have a table with 153 rows by 9 columns. 我有一个153行乘9列的表。 My interest is the character string in the first column, I want to extract the fourth word and create a new list from this fourth word, this list will be 153 rows, 1 column. 我感兴趣的是第一列中的字符串,我想从第四个单词中提取第四个单词并创建一个新列表,这个列表将是153行,1列。

An example of the first two rows of column 1 of this database table: 此数据库表的第1列的前两行示例:

[1] Resistance_Test DevID (Ohms) 428
[2] Diode_Test SUBLo (V) 353

"Words" are separated by spaces, so the fourth word of the first row is "428" and the fourth word of the second row is "353". “单词”由空格分隔,因此第一行的第四个单词是“428”,第二行的第四个单词是“353”。 How can I create a new list containing the fourth word of all 153 rows? 如何创建包含所有153行的第四个单词的新列表?

Use gsub() with a regular expression gsub()与正则表达式一起使用

x <- c("Resistance_Test DevID (Ohms) 428", "Diode_Test SUBLo (V) 353")
ptn <- "(.*? ){3}"
gsub(ptn, "", x)

[1] "428" "353"

This works because the regular expression (.*? ){3} finds exactly three {3} sets of characters followed by a space (.*? ) , and then replaces this with ane empty string. 这是有效的,因为正则表达式(.*? ){3}恰好找到三个{3}字符集后跟一个空格(.*? ) ,然后用空字符串替换它。

See ?gsub and ?regexp for more information. 有关更多信息,请参阅?gsub?regexp


If your data has structure that you don't mention in your question, then possibly the regular expression becomes even easier. 如果您的数据具有您在问题中未提及的结构,那么正则表达式可能会变得更加容易。

For example, if you are always interested in the last word of each line: 例如,如果您始终对每行的最后一个字感兴趣:

ptn <- "(.*? )"
gsub(ptn, "", x)

Or perhaps you know for sure you can only search for digits and discard everything else: 或许你肯定知道你只能搜索数字并丢弃其他所有内容:

ptn <- "\\D"
gsub(ptn, "", x)

You could use word() from the stringr package: 你可以使用stringr包中的word()

> x <- c("Resistance_Test DevID (Ohms) 428", "Diode_Test SUBLo (V) 353")
> library(stringr)
> word(string = x, start = 4, end = 4)
[1] "428" "353"

Specifying the position of both the start and end words to be the same, you will always get the fourth word. 指定开始和结束单词的位置相同,您将始终获得第四个单词。

I hope this helps. 我希望这有帮助。

We can use sub . 我们可以使用sub We match the pattern one or more non-white space ( \\\\S+ ) followed by one or more white space ( \\\\s+ ) that gets repeated 3 times ( {3} ) followed by word that is captured in a group ( (\\\\w+) ) followed by one or more characters. 我们将模式匹配一​​个或多个非空白空间( \\\\S+ ),然后是一个或多个空格( \\\\s+ ),重复3次( {3} ),然后是在一个组中捕获的单词( (\\\\w+) )后跟一个或多个字符。 We replace it by the second backreference. 我们用第二个反向引用替换它。

sub("(\\S+\\s+){3}(\\w+).*", "\\2", str1)
#[1] "428" "353"

This selects by the nth word, so 这由第n个字选择,所以

 sub("(\\S+\\s+){3}(\\w+).*", "\\2", str2)
 #[1] "428" "353" "428"

Another option is stri_extract 另一个选项是stri_extract

 library(stringi)
 stri_extract_last_regex(str1, "\\w+")
 #[1] "428" "353"

data 数据

str1 <- c("Resistance_Test DevID (Ohms) 428", "Diode_Test SUBLo (V) 353")
str2 <- c(str1, "Resistance_Test DevID (Ohms) 428 something else")

If you are not familiar with regular expressions, the function strsplit can help you : 如果您不熟悉正则表达式, strsplit函数可以帮助您:

data <- c('Resistance_Test DevID (Ohms) 428', 'Diode_Test SUBLo (V) 353')
unlist(lapply(strsplit(data, ' '), function(x) x[4]))
[1] "428" "353"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM