简体   繁体   English

使用不稳定的数据格式拆分 R 中的列

[英]Split a column in R with inconstant data format

I have an R dataframe that has 17 columns.我有一个有 17 列的 R dataframe。 One column contains the unique identifiers that I will use to merge with other dataframes.一列包含我将用于与其他数据框合并的唯一标识符。 However, some rows in this column contains extra data which makes merging not possible.但是,此列中的某些行包含额外的数据,导致无法合并。 Here is a subset of different types of data I'm looking at.这是我正在查看的不同类型数据的子集。

M2017013708-MN-M02199-180405
M201701492756-MN-M05144-180419
M2016019446_S3_L001
M2016019762

All data after the -MN is considered extra data that needs to be removed. -MN 之后的所有数据都被认为是需要删除的额外数据。 My goal is to add a new column to the dataframe without the extra data.我的目标是在没有额外数据的情况下向 dataframe 添加一个新列。 It would look like this:它看起来像这样:

M2017013708
M201701492756
M2016019446_S3_L001
M2016019762

I've tried to split the data at -MN which makes a list then make it into a dataframe with ldply.我试图在 -MN 处拆分数据,该列表创建一个列表,然后使用 ldply 将其放入 dataframe 中。 However, this results in an error because the split causes a list of multiple lengths since not all rows have a -MN.但是,这会导致错误,因为拆分会导致多个长度的列表,因为并非所有行都有 -MN。

split_my_data <- strsplit(my_data$sample_name, '-MN')
df <- ldply(split_my_data)

I tried using using a case and regualr expression with sql with sqldf.我尝试使用带有 sql 和 sqldf 的 case 和正则表达式。 However, I get an error of no such function REGEXP.但是,我收到没有这样的 function REGEXP 的错误。

Any help would be greatly appreciated.任何帮助将不胜感激。

Or you can try this method using Look Beind Regex (?<=)或者您可以使用Look Beind Regex (?<=)尝试此方法

df <- data.frame(OBS = 1:4, 
                 CODE = c("M2017013708-MN-M02199-180405",
                             "M201701492756-MN-M05144-180419",
                             "M2016019446_S3_L001",
                             "M2016019762"))
df2 <- df %>% 
  mutate(CODE2 = str_replace_all(CODE, regex("(?<=)-MN.*"), ""))
# OBS                           CODE               CODE2
# 1   1   M2017013708-MN-M02199-180405         M2017013708
# 2   2 M201701492756-MN-M05144-180419       M201701492756
# 3   3            M2016019446_S3_L001 M2016019446_S3_L001
# 4   4                    M2016019762         M2016019762

A simple tidy solution could also be:一个简单整洁的解决方案也可以是:

library(dplyr)
library(stringr)

data <- tibble(dirty = c('M2017013708-MN-M02199-180405',
                         'M201701492756-MN-M05144-180419',
                         'M2016019446_S3_L001',
                         'M2016019762'))

data %>%
  mutate(clean = str_remove(dirty, pattern = '-MN.*'))

# A tibble: 4 x 2
  dirty                          clean              
  <chr>                          <chr>              
1 M2017013708-MN-M02199-180405   M2017013708        
2 M201701492756-MN-M05144-180419 M201701492756      
3 M2016019446_S3_L001            M2016019446_S3_L001
4 M2016019762                    M2016019762 

SQLite SQLite

Regarding SQLite, regular expressions are only available if regular expression support is turned on when SQLite is built but RSQLite did not do that so it is not available.关于 SQLite,正则表达式只有在 SQLite 构建时打开正则表达式支持时才可用,但 RSQLite 没有这样做,因此它不可用。

What you can do is append -MN- onto the end of each string to ensure that there is always at least one occurrence and then search for it using instr and take the substring to that point using substr :您可以做的是 append -MN-到每个字符串的末尾,以确保始终至少出现一次,然后使用instr搜索它并使用 substring 到该点使用substr

library(sqldf)
sqldf("select V1, substr(V1, 1, instr(V1 || '-MN-', '-MN-') - 1) as V2 from DF")

giving:给予:

                              V1                  V2
1   M2017013708-MN-M02199-180405         M2017013708
2 M201701492756-MN-M05144-180419       M201701492756
3            M2016019446_S3_L001 M2016019446_S3_L001
4                    M2016019762         M2016019762

H2 H2

If we use the H2 backend to sqldf instead of SQLite then we can use regular expressions.如果我们使用 H2 后端到 sqldf 而不是 SQLite 那么我们可以使用正则表达式。 The RH2 package includes both the R driver and H2 itself and if it is loaded sqldf will assume you wanted to use it instead of SQLite. RH2 package 包括 R 驱动程序和 H2 本身,如果加载了 sqldf 将假定您想使用它而不是 Z497757A9C5B2EC17DED656170B51C788。 The order in which RH2 and sqldf are loaded does not matter.加载 RH2 和 sqldf 的顺序无关紧要。

library(RH2)
library(sqldf)

sqldf("select V1, regexp_replace(V1, '-MN-.*', '') as V2 from DF")

Note笔记

The input in reproducible form is:可重现形式的输入是:

DF <- data.frame(V1 = c("M2017013708-MN-M02199-180405",
                        "M201701492756-MN-M05144-180419",
                        "M2016019446_S3_L001",
                        "M2016019762"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM