清理 R data.table 中的字符串和数字列

Question

I try to clean a data.frame where I have columns with text and also numbers.我尝试清理一个 data.frame，其中包含带有文本和数字的列。 I would like to exclude the numbers in the example column "name" and only take the first number (without string) for the column "number".我想排除示例列“名称”中的数字，只取“数字”列的第一个数字（不带字符串）。

I am using data.table and created this frame:我正在使用 data.table 并创建了这个框架：

df <- data.frame(x=c(1,2,3,4,5,6,7,8),
                 name=c('Tom', 'Maria. Anna3', 'Ina.2', 'Anna13', 'Tim2a', 'Zoé', 'Mark_1', 'Bea: 2'), 
                 number=c('12, 13', '11/12', '3b', '12, 13', '134z', 'number 14', 'B3', '3-5'))

As described above, I would expect a cleaned table like this:如上所述，我希望有这样一张干净的桌子：

df_cleaned <- data.frame(x=c(1,2,3,4,5,6,7,8),
                         name=c('Tom', 'Maria Anna', 'Ina', 'Anna', 'Tim', 'Zoé', 'Mark', 'Bea'),
                         number=c('12', '11', '3', '12', '134', '14', '3', '3'))

Thank you very much for your reply:)非常感谢您的回复：）

Answer 1

You can use readr::parse_number which does exactly that.您可以使用readr::parse_number来做到这一点。

readr::parse_number(df$number)
#[1]  12  11   3  12 134  14   3   3

Or in base R -或者在基础 R -

as.numeric(sub('.*?(\\d+).*', '\\1', df$number))

To clean up the names, you can use the regex -要清理名称，您可以使用正则表达式 -

df$name <- sub('([ :_.]|\\d).*', '', df$name)
#[1] "Tom"   "Maria" "Ina"   "Anna"  "Tim"   "Zoé"   "Mark"  "Bea"

Answer 2

Does this work:这是否有效：

library(dplyr)
library(stringr)
df %>% mutate(name = str_extract(name, '[A-Za-z]+'), number = parse_number(number))
  x  name number
1 1   Tom     12
2 2 Maria     11
3 3   Ina      3
4 4  Anna     12
5 5   Tim    134
6 6    Zo     14
7 7  Mark      3
8 8   Bea      3

清理 R data.table 中的字符串和数字列

问题描述

2 个解决方案

解决方案1
4 已采纳 2021-10-21 06:53:29

解决方案2
2 2021-10-21 07:02:03

清理 R data.table 中的字符串和数字列

问题描述

2 个解决方案

解决方案1 4 已采纳 2021-10-21 06:53:29

解决方案2 2 2021-10-21 07:02:03

解决方案1
4 已采纳 2021-10-21 06:53:29

解决方案2
2 2021-10-21 07:02:03