简体   繁体   English

拆分列的两个条件

[英]Two conditions for split a column

I have a dataframe like this: 我有这样的数据帧:

I would like to split the second column into many columns based on "?". 我想根据“?”将第二列拆分成多列。 However it is not easy because there are other question marker in the sting. 然而这并不容易,因为刺中还有其他问号。 So the only additional common it that every substring contains the 'http'. 因此,每个子字符串都包含'http'的唯一另外的共同点。

How is it possible to split them? 怎么可能拆分它们? The number of columns in the output example is just an example I don't know exactly how many could be generated. 输出示例中的列数只是一个示例我不确切知道可以生成多少列。

Example of input data: 输入数据示例:

 df_in <- data.frame(x = c('x1','x2','x3','x4'),
                     y = c('http://example1.com?https://example2.com', 'NA', 'http://example3.com?id=1234?https://example4/com?http://example6.com', 'http://example5.com'))

the dataframe as printed in console: 在控制台中打印的数据框:

 df_in
  x                                                                    y
 x1                             http://example1.com?https://example2.com
 x2                                                                   NA
 x3 http://example3.com?id=1234?https://example4/com?http://example6.com
 x4                                                  http://example5.com

Example of expected output: 预期产出示例:

df_out <- data.frame(x = c('x1','x2','x3','x4'),
                     col1 = c('http://example1.com', 'NA', 'http://example3.com?id=1234', 'http://example5.com'),
                     col2 = c('https://example2.com', 'NA', 'https://example4/com', 'NA'),
                     col3 = c('NA', 'NA', 'https://example6/com', 'NA'))

The output as printed in the console: 控制台中打印的输出:

 x                        col1                 col2                 col3
 x1         http://example1.com https://example2.com                   NA
 x2                          NA                   NA                   NA
 x3 http://example3.com?id=1234 https://example4/com https://example6/com
 x4         http://example5.com                   NA                   NA

We can use separate from tidyr to separate the column 'y' into multiple columns by separating at the ? 我们可以separate使用tidyr将列'y'分成多列,分别在? that is before the http 那是在http之前

library(tidyr)
df_in %>%
     separate(y, into = paste0("col", 1:3), sep="[?](?=http)")
#   x                        col1                 col2                col3
#1 x1         http://example1.com https://example2.com                <NA>
#2 x2                          NA                 <NA>                <NA>
#3 x3 http://example3.com?id=1234 https://example4/com http://example6.com
#4 x4         http://example5.com                 <NA>                <NA>

If you have an arbitrary number of domains to split, hence not knowing the number of columns to be produced, you can use cSplit function from splitstackshape package. 如果要分割任意数量的域,因此不知道要生成的列数,可以使用splitstackshape包中的cSplit函数。 However, before doing that, we need to add a delimeter right before ?http , ie 但是,在这之前,我们需要在之前添加一个分隔符?http ,即

library(splitstackshape)

df_in$y <-  gsub('(\\w)(\\?h)', '\\1_\\2', df_in$y)
cSplit(df_in 'y', '_?')

#Or all in one line,
cSplit(transform(df_in, y = gsub('(\\w)(\\?h)', '\\1_\\2', y)), 'y', '_?')

which gives, 这使,

  x y_1 y_2 y_3 1: x1 http://example1.com https://example2.com NA 2: x2 NA NA NA 3: x3 http://example3.com?id=1234 https://example4/com http://example6.com 4: x4 http://example5.com NA NA 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM