简体   繁体   English

如何将文本分成多列?

[英]How to split text into multiple columns?

I have a column that has text in the format:我有一列包含以下格式的文本:

ID-XXXXX Process for Description [1/5]

I would like this to be broken into three columns where:我希望将其分为三列,其中:

A = ID-XXXXX

B = Process for Description

C = 1/5

Any ideas on how to split this properly?关于如何正确分割的任何想法?

here is an attempt to help you.这是帮助您的尝试。 Mind that the first part is a bit tricky and I used a regex with the idea that XXXXX will always be 5 character long.请注意,第一部分有点棘手,我使用了一个正则表达式,认为XXXXX将始终为 5 个字符长。

d = "ID-XXXXX Process for Description [1/5]"

a =sub('[  ].+',"",d)

c = sub('.+[  ][[]',"",d) ; c = sub('[]]',"",c)

b = sub('[  ][[].*[]]',"",d) ;b = gsub('ID-.{5}[ ]',"",b)

f = c(a,b,c) ; f
# [1] "ID-XXXXX" "Process for Description" "1/5" 

Using stringr , there are several options:使用stringr ,有几个选项:

dat <- data.frame(my_string = "ID-XXXXX Process for Description [1/5]")

dat %>% 
  mutate(A = str_extract(string = my_string, pattern = "ID-.{5}"),
         B = str_replace(string = my_string, pattern = "ID-.{5}\\s(.+)\\s\\[.*\\]", replacement = "\\1"),
         C = str_match(string = my_string, pattern = "\\[(.*)\\]")[2])

A : extract the following pattern : ID- followed by exactly 5 characters A:提取以下模式: ID-后跟正好5个字符
B : captures the group between ID-XXXXX and [XX] , and replace the entire pattern with the captured pattern B : 捕获ID-XXXXX[XX]之间的组,并用捕获的模式替换整个模式
C : matches the captured pattern (.*) between the squared brackets (the 2nd column of str_match returns the captured pattern) C : 匹配方括号之间的捕获模式(.*)str_match的第 2 列返回捕获的模式)

Result:结果:

                               my_string        A                       B   C
1 ID-XXXXX Process for Description [1/5] ID-XXXXX Process for Description 1/5

EDIT :编辑
I just remembered that the extract() function from tidyr does exactly that.我只记得tidyrextract()函数就是tidyr做的。
Using capturing groups between parenthesis in the regex argument, you get these into new columns directly.使用regex参数中括号之间的捕获组,您可以直接将这些组放入新列中。

dat <- data.frame(my_string = paste0("ID-0000", 1:5, " Process_", LETTERS[1:5], " [", 1:5, "/5]"))

extract(data = dat,
        col = my_string, 
        into = c("A", "B", "C"), 
        regex = "(ID-.{5})\\s(.+)\\s\\[(.*)\\]", 
        remove = FALSE)

                 my_string        A         B   C
1 ID-00001 Process_A [1/5] ID-00001 Process_A 1/5
2 ID-00002 Process_B [2/5] ID-00002 Process_B 2/5
3 ID-00003 Process_C [3/5] ID-00003 Process_C 3/5
4 ID-00004 Process_D [4/5] ID-00004 Process_D 4/5
5 ID-00005 Process_E [5/5] ID-00005 Process_E 5/5

If you don't want to keep the original string, use remove = TRUE .如果您不想保留原始字符串,请使用remove = TRUE

You may also use tidyr::extract to do this sytematically.您也可以使用tidyr::extract来系统地执行此操作。 Example elaborated for purpose of demonstration-为演示目的而详细说明的示例-

  • extract everything upto first space into first capture将第一个空间的所有内容提取到第一个捕获中
  • extract everything upto [ into second capture group[所有内容提取到第二个捕获组中
  • extract everything upto ] into third capture group将所有内容提取到]到第三个捕获组

This way you don't have limitation in number of characters per capture group.这样您就不会限制每个捕获组的字符数。

vec <- c("ID-XXXXX Process for Description [1/5]", "ID-XXXXXYZ Process for Description something [1/5]", "ID-XXXXXFFF Process for Description something else [1/905]", "ID-XXXXXYYYYP Process for Description [900001/5]")
df <- data.frame(col = vec)
df
#>                                                          col
#> 1                     ID-XXXXX Process for Description [1/5]
#> 2         ID-XXXXXYZ Process for Description something [1/5]
#> 3 ID-XXXXXFFF Process for Description something else [1/905]
#> 4           ID-XXXXXYYYYP Process for Description [900001/5]
library(tidyverse)
df %>%
  extract(col, into = c('A', 'B', 'C'), regex = '^([^\\s]*)\\s([^\\[]*)\\[([^\\]]*)\\]$')
#>               A                                       B        C
#> 1      ID-XXXXX                Process for Description       1/5
#> 2    ID-XXXXXYZ      Process for Description something       1/5
#> 3   ID-XXXXXFFF Process for Description something else     1/905
#> 4 ID-XXXXXYYYYP                Process for Description  900001/5

Created on 2021-05-30 by the reprex package (v2.0.0)reprex 包( v2.0.0 ) 于 2021 年 5 月 30 日创建

We could use str_extract我们可以使用str_extract

df %>% 
  mutate(A = str_extract(col1, "ID-XXXX"),
         B = str_extract(col1, "Process for Description"),
         C = str_extract(col1, "\\[1\\/5\\]"))

Output:输出:

# A tibble: 1 x 4
  col1                                   A       B                       C    
  <chr>                                  <chr>   <chr>                   <chr>
1 ID-XXXXX Process for Description [1/5] ID-XXXX Process for Description [1/5]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM