如何将文本分成多列？

Question

I have a column that has text in the format:我有一列包含以下格式的文本：

ID-XXXXX Process for Description [1/5]

I would like this to be broken into three columns where:我希望将其分为三列，其中：

A = ID-XXXXX

B = Process for Description

C = 1/5

Any ideas on how to split this properly?关于如何正确分割的任何想法？

Answer 1

here is an attempt to help you.这是帮助您的尝试。 Mind that the first part is a bit tricky and I used a regex with the idea that XXXXX will always be 5 character long.请注意，第一部分有点棘手，我使用了一个正则表达式，认为XXXXX将始终为 5 个字符长。

d = "ID-XXXXX Process for Description [1/5]"

a =sub('[  ].+',"",d)

c = sub('.+[  ][[]',"",d) ; c = sub('[]]',"",c)

b = sub('[  ][[].*[]]',"",d) ;b = gsub('ID-.{5}[ ]',"",b)

f = c(a,b,c) ; f
# [1] "ID-XXXXX" "Process for Description" "1/5"

Answer 2

Using stringr , there are several options:使用stringr ，有几个选项：

dat <- data.frame(my_string = "ID-XXXXX Process for Description [1/5]")

dat %>% 
  mutate(A = str_extract(string = my_string, pattern = "ID-.{5}"),
         B = str_replace(string = my_string, pattern = "ID-.{5}\\s(.+)\\s\\[.*\\]", replacement = "\\1"),
         C = str_match(string = my_string, pattern = "\\[(.*)\\]")[2])

A : extract the following pattern : ID- followed by exactly 5 characters A：提取以下模式： ID-后跟正好5个字符
B : captures the group between ID-XXXXX and [XX] , and replace the entire pattern with the captured pattern B : 捕获ID-XXXXX和[XX]之间的组，并用捕获的模式替换整个模式
C : matches the captured pattern (.*) between the squared brackets (the 2nd column of str_match returns the captured pattern) C : 匹配方括号之间的捕获模式(.*) （ str_match的第 2 列返回捕获的模式）

Result:结果：

                               my_string        A                       B   C
1 ID-XXXXX Process for Description [1/5] ID-XXXXX Process for Description 1/5

EDIT :编辑：
I just remembered that the extract() function from tidyr does exactly that.我只记得tidyr的extract()函数就是tidyr做的。
Using capturing groups between parenthesis in the regex argument, you get these into new columns directly.使用regex参数中括号之间的捕获组，您可以直接将这些组放入新列中。

dat <- data.frame(my_string = paste0("ID-0000", 1:5, " Process_", LETTERS[1:5], " [", 1:5, "/5]"))

extract(data = dat,
        col = my_string, 
        into = c("A", "B", "C"), 
        regex = "(ID-.{5})\\s(.+)\\s\\[(.*)\\]", 
        remove = FALSE)

                 my_string        A         B   C
1 ID-00001 Process_A [1/5] ID-00001 Process_A 1/5
2 ID-00002 Process_B [2/5] ID-00002 Process_B 2/5
3 ID-00003 Process_C [3/5] ID-00003 Process_C 3/5
4 ID-00004 Process_D [4/5] ID-00004 Process_D 4/5
5 ID-00005 Process_E [5/5] ID-00005 Process_E 5/5

If you don't want to keep the original string, use remove = TRUE .如果您不想保留原始字符串，请使用remove = TRUE 。

Answer 3

You may also use tidyr::extract to do this sytematically.您也可以使用tidyr::extract来系统地执行此操作。 Example elaborated for purpose of demonstration-为演示目的而详细说明的示例-

extract everything upto first space into first capture将第一个空间的所有内容提取到第一个捕获中
extract everything upto [ into second capture group将[所有内容提取到第二个捕获组中
extract everything upto ] into third capture group将所有内容提取到]到第三个捕获组

This way you don't have limitation in number of characters per capture group.这样您就不会限制每个捕获组的字符数。

vec <- c("ID-XXXXX Process for Description [1/5]", "ID-XXXXXYZ Process for Description something [1/5]", "ID-XXXXXFFF Process for Description something else [1/905]", "ID-XXXXXYYYYP Process for Description [900001/5]")
df <- data.frame(col = vec)
df
#>                                                          col
#> 1                     ID-XXXXX Process for Description [1/5]
#> 2         ID-XXXXXYZ Process for Description something [1/5]
#> 3 ID-XXXXXFFF Process for Description something else [1/905]
#> 4           ID-XXXXXYYYYP Process for Description [900001/5]
library(tidyverse)
df %>%
  extract(col, into = c('A', 'B', 'C'), regex = '^([^\\s]*)\\s([^\\[]*)\\[([^\\]]*)\\]$')
#>               A                                       B        C
#> 1      ID-XXXXX                Process for Description       1/5
#> 2    ID-XXXXXYZ      Process for Description something       1/5
#> 3   ID-XXXXXFFF Process for Description something else     1/905
#> 4 ID-XXXXXYYYYP                Process for Description  900001/5

^{Created on 2021-05-30 by the reprex package (v2.0.0)}^{由reprex 包( v2.0.0 ) 于 2021 年 5 月 30 日创建}

Answer 4

We could use str_extract我们可以使用str_extract

df %>% 
  mutate(A = str_extract(col1, "ID-XXXX"),
         B = str_extract(col1, "Process for Description"),
         C = str_extract(col1, "\\[1\\/5\\]"))

Output:输出：

# A tibble: 1 x 4
  col1                                   A       B                       C    
  <chr>                                  <chr>   <chr>                   <chr>
1 ID-XXXXX Process for Description [1/5] ID-XXXX Process for Description [1/5]

如何将文本分成多列？

问题描述

4 个解决方案

解决方案1
0 2020-01-30 13:21:59

解决方案2
0 2020-02-08 16:53:30

解决方案3
0 2021-05-30 12:38:50

解决方案4
0 2021-05-30 13:17:01

如何将文本分成多列？

问题描述

4 个解决方案

解决方案1 0 2020-01-30 13:21:59

解决方案2 0 2020-02-08 16:53:30

解决方案3 0 2021-05-30 12:38:50

解决方案4 0 2021-05-30 13:17:01

解决方案1
0 2020-01-30 13:21:59

解决方案2
0 2020-02-08 16:53:30

解决方案3
0 2021-05-30 12:38:50

解决方案4
0 2021-05-30 13:17:01