[英]Two conditions for split a column
I have a dataframe like this: 我有这样的数据帧:
I would like to split the second column into many columns based on "?". 我想根据“?”将第二列拆分成多列。 However it is not easy because there are other question marker in the sting.
然而这并不容易,因为刺中还有其他问号。 So the only additional common it that every substring contains the 'http'.
因此,每个子字符串都包含'http'的唯一另外的共同点。
How is it possible to split them? 怎么可能拆分它们? The number of columns in the output example is just an example I don't know exactly how many could be generated.
输出示例中的列数只是一个示例我不确切知道可以生成多少列。
Example of input data: 输入数据示例:
df_in <- data.frame(x = c('x1','x2','x3','x4'),
y = c('http://example1.com?https://example2.com', 'NA', 'http://example3.com?id=1234?https://example4/com?http://example6.com', 'http://example5.com'))
the dataframe as printed in console: 在控制台中打印的数据框:
df_in
x y
x1 http://example1.com?https://example2.com
x2 NA
x3 http://example3.com?id=1234?https://example4/com?http://example6.com
x4 http://example5.com
Example of expected output: 预期产出示例:
df_out <- data.frame(x = c('x1','x2','x3','x4'),
col1 = c('http://example1.com', 'NA', 'http://example3.com?id=1234', 'http://example5.com'),
col2 = c('https://example2.com', 'NA', 'https://example4/com', 'NA'),
col3 = c('NA', 'NA', 'https://example6/com', 'NA'))
The output as printed in the console: 控制台中打印的输出:
x col1 col2 col3
x1 http://example1.com https://example2.com NA
x2 NA NA NA
x3 http://example3.com?id=1234 https://example4/com https://example6/com
x4 http://example5.com NA NA
We can use separate
from tidyr
to separate the column 'y' into multiple columns by separating at the ?
我们可以
separate
使用tidyr
将列'y'分成多列,分别在?
that is before the http
那是在
http
之前
library(tidyr)
df_in %>%
separate(y, into = paste0("col", 1:3), sep="[?](?=http)")
# x col1 col2 col3
#1 x1 http://example1.com https://example2.com <NA>
#2 x2 NA <NA> <NA>
#3 x3 http://example3.com?id=1234 https://example4/com http://example6.com
#4 x4 http://example5.com <NA> <NA>
If you have an arbitrary number of domains to split, hence not knowing the number of columns to be produced, you can use cSplit
function from splitstackshape
package. 如果要分割任意数量的域,因此不知道要生成的列数,可以使用
splitstackshape
包中的cSplit
函数。 However, before doing that, we need to add a delimeter right before ?http
, ie 但是,在这之前,我们需要在之前添加一个分隔符
?http
,即
library(splitstackshape)
df_in$y <- gsub('(\\w)(\\?h)', '\\1_\\2', df_in$y)
cSplit(df_in 'y', '_?')
#Or all in one line,
cSplit(transform(df_in, y = gsub('(\\w)(\\?h)', '\\1_\\2', y)), 'y', '_?')
which gives, 这使,
x y_1 y_2 y_3 1: x1 http://example1.com https://example2.com NA 2: x2 NA NA NA 3: x3 http://example3.com?id=1234 https://example4/com http://example6.com 4: x4 http://example5.com NA NA
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.