R-在字符串中获取确切的8位数字并将其转换

Question

I have 2 problems in extracting and transforming data using R. Here's the dataset: 我在使用R提取和转换数据时遇到两个问题。这是数据集：

messageID | msg
1111111111 | hey id 18271801, fix it asap
2222222222 | please fix it soon id12901991 and 91222911. dissapointed
3333333333 | wow $300 expensive man, come on
4444444444 | number 2837169119 test

The problem is: 问题是：

I want to grab the number with only 8 digits length. 我想抓住只有8位数字的数字。 In the dataset above, message id 3333...(300 - 3 digits) and 4444...(2837169119 - 10 digits) should not included. 在上面的数据集中，不应该包含消息ID 3333 ...（300-3位数字）和4444 ...（2837169119-10位数字）。 And here's my best shot so far: 这是到目前为止我最好的镜头：

 as.matrix(unlist(apply(df[2],1,function(x){regmatches(x,gregexpr('([0-9]){8}', x))})))

. 。
However, with this line of code, message 444... is included because is contains more than 8 digits number. 但是，在此代码行中，包含消息444 ...，因为它包含多于8位数字。

Transform the data to another form like this: 将数据转换为另一种形式，如下所示：

 message_id | customer_ID 1111111111 | 18271801 2222222222 | 12901991 2222222222 | 91222911

I don't know how to efficiently transform the data. 我不知道如何有效地转换数据。 The output of dput(df) : dput(df)的输出：

 structure(list(id = c(1111111111, 2222222222, 3333333333, 4444444444 ), msg = c("hey id 18271801, fix it asap", "please fix it soon id12901991 and 91222911. dissapointed", "wow $300 expensive man, come on", "number 2837169119 test")), .Names = c("id", "msg"), row.names = c(NA, 4L), class = "data.frame")

Thanks 谢谢

Answer 1

Use rebus to create your regular expression, and stringr to extract the matches. 使用rebus创建正则表达式，并使用stringr提取匹配项。

You may need to play with the exact form of the regular expression. 您可能需要使用正则表达式的确切形式。 This code works on your examples, but you'll probably need to adapt it for your dataset. 这段代码适用于您的示例，但是您可能需要对其进行调整以适合您的数据集。

library(rebus)
library(stringr)

# Create regex
rx <- negative_lookbehind(DGT) %R%
  dgt(8) %R%  
  negative_lookahead(DGT)
rx
## <regex> (?<!\d)[\d]{8}(?!\d)

# Extract the IDs
extracted_ids <- str_extract_all(df$msg, perl(rx))

# Stuff the IDs into a data frame.
data.frame(
  messageID = rep(
    df$id, 
    vapply(extracted_ids, length, integer(1))
  ),
  extractedID = unlist(extracted_ids, use.names = FALSE)
)

R-在字符串中获取确切的8位数字并将其转换

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-03-22 06:45:04

R-在字符串中获取确切的8位数字并将其转换

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-03-22 06:45:04

解决方案1
1 已采纳 2015-03-22 06:45:04