简体   繁体   English

从字符串中提取数字,如果它后面跟着 R 中的某些字符

[英]Extract number from a character string, if it is followed by certain characters in R

I have a dataframe with a variable that contains food quantities in different measurement units.我有一个包含不同测量单位的食物数量的变量的数据框。 The dataframe contains ~11000 observations.数据框包含约 11000 个观测值。

Let me give you this example: "10gr peterselie, 7 Grams look, 5g kruiden en 400GRAMM bouillon, 2 tbsp olive-oil, 1oz ketchup, 20 grapes, 1 gelbe Paprika"让我给你这个例子:“10gr peterselie,7 克外观,5g kruiden en 400GRAMM 肉汤,2 汤匙橄榄油,1 盎司番茄酱,20 颗葡萄,1 颗辣椒粉”

I found a way to extract the numbers and sum them up, using this function:我找到了一种提取数字并将它们相加的方法,使用此函数:

sum_numerics <- function(x) {

  # Grab all numbers that appear 
  matches <- str_match_all(x, "[0-9]+")

  # Grab the matches column in the list, transform to numeric, then sum
  sapply(matches, function(y) sum(as.numeric(y)))

}

What I'm looking for is a way to extract all food quantities that are in grams and write them into a new variable to sum them up in the next step.我正在寻找的是一种提取所有以克为单位的食物数量并将它们写入一个新变量以在下一步中对它们进行汇总的方法。 I spend some time looking for ways to do this and spend some time solving the problem with the regex-demo , but I can't find a working solution and I really can't figure out how to write working regex-functions.我花了一些时间寻找这样做的方法,并花了一些时间用regex-demo解决问题,但我找不到有效的解决方案,我真的不知道如何编写有效的 regex 函数。 Shame on me!耻辱我!

User "Max Teflon" provided a possible solution that looks, after some more investigation, like this:用户“Max Teflon”提供了一个可能的解决方案,经过更多调查后,看起来像这样:

get_gramms <- function(x) {

# Grab all numbers that appear
str_extract_all(x, "([0-9]+\\s?([gG]|[gGrRaAmM]{5,6}|[gGrRaAmM]{2}))") %>% # any number followed by an optional space and a small/capital g%>%

unlist() %>%

str_remove_all('[[:alpha:]]') %>% # a vector is what we want

str_trim() %>% # remove all trailing whitespaces

as.numeric() # change to numbers

}

x %>%
mutate(var = map(var,~get_gramms(.))) %>%
mutate(var = map_dbl(var,~ifelse(length(.)>0,sum(.),NA)))

I think his answer is close to solving my problem, but it still returns wrong values, for example for "1 gelbe Paprika".我认为他的回答接近于解决我的问题,但它仍然返回错误的值,例如“1 gelbe Paprika”。

Looking forward to new ideas, solutions!期待新的想法,解决方案!

You could use a look-ahead assertion and remove the whitespaces afterwards:您可以使用前瞻断言并在之后删除空格:

library(tidyverse)
x <- "10gr peterselie, 7 Grams look, 5g kruiden en 400GRAMM bouillon, 2 tbsp olive-oil, 1oz ketchup"

sum_numerics <- function(x) {

  # Grab all numbers that appear 
  str_match_all(x, "[0-9]+\\s?(?=[gG])") %>% # any number followed by an optional space and a small/capital g
    unlist() %>% # a vector is what we want
    str_trim() %>% # remove all trailing whitespaces
    as.numeric() %>% # change to number
    sum() # sum it up

}
sum_numerics(x)
#> [1] 422

Or, if you just want to get all the numbers and use them afterwards:或者,如果您只想获取所有数字并在之后使用它们:

library(tidyverse)
x <- "10gr peterselie, 7 Grams look, 5g kruiden en 400GRAMM bouillon, 2 tbsp olive-oil, 1oz ketchup"

get_gramms <- function(x) {

  # Grab all numbers that appear 
  str_match_all(x, "[0-9]+\\s?(?=[gG])") %>% # any number followed by an optional space and a small/capital g
    unlist() %>% # a vector is what we want
    str_trim() %>% # remove all trailing whitespaces
    as.numeric() # change to numbers
}
get_gramms(x)
#> [1]  10   7   5 400

Note that the whitespace can not be put into the assertion since it is optional and an assertion needs a fixed length.请注意,空格不能放入断言中,因为它是可选的,并且断言需要固定长度。

Maybe you can try the code below, using gsub() + regmatches() + gregexpr() from base R也许你可以试试下面的代码,使用来自基础 R 的gsub() + regmatches() + gregexpr()

r <- sum(as.numeric(gsub("(\\d+).*",
                         "\\1",
                         unlist(regmatches(s,gregexpr("\\d+\\s?(g|gr|grams|gram)\\b",s,ignore.case = T))))))

such that以至于

> r
[1] 422

DATA数据

s <- "10gr peterselie, 7 Grams look, 5g kruiden en 400GRAMM bouillon, 2 tbsp olive-oil, 1oz ketchup"

EDIT : If you want to the manipulation along a column, maybe you can do it like below编辑:如果你想沿着一列进行操作,也许你可以像下面那样做

f <- Vectorize(function(s) {
  sum(as.numeric(gsub("(\\d+).*",
                      "\\1",
                      unlist(regmatches(s,gregexpr("\\d+\\s?(g|gr|grams|gram)\\b",s,ignore.case = T))))))
}
)

df <- within(df, y <- f(x))
df <- within(df, y <- ifelse(y==0,NA,1))

This is somewhat ugly but we can use:这有点难看,但我们可以使用:

sum(as.numeric(unlist(sapply(strsplit(my_string,","),
        function(x) stringr::str_extract_all(gsub("\\s","",x),
                "\\d+(?=[gG][rams]?)")))))#credit to ThomasisCoding(learnt something new)
[1] 422

Data:数据:

my_string<-"10gr peterselie, 7 Grams look, 5g kruiden en 400GRAMM bouillon, 2 tbsp olive-oil, 1oz ketchup"

Using str_extract_all使用str_extract_all

library(stringr)

str_extract_all(my_string,"[0-9]+(?=[ ]{0,2}[gG])")[[1]] %>% 
  as.numeric()%>%
  sum()

[1] 422

if now you have a vector of strings:如果现在你有一个字符串向量:

mystrings <- c("10gr peterselie, 7 Grams look, 5g kruiden en 400GRAMM bouillon, 2 tbsp olive-oil, 1oz ketchup",
               "but also 5g of something and 10 Gr of other stuffs")

str_extract_all(mystrings,"[0-9]+(?=[ ]{0,2}[gG])") %>%
  lapply(.,function(x) as.numeric(x) %>%
           sum()
         )

[[1]]
[1] 422

[[2]]
[1] 15

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM