[英]How to prevent tidyr's separate function from pulling in row numbers and then dropping data
I am trying to write a line of code to separate a text string whenever a capital letter in encountered without removing the letter.每当遇到大写字母而不删除字母时,我都试图编写一行代码来分隔文本字符串。 The approach I have taken is as follows:
我采取的方法如下:
set.seed(1)
# create a dataframe of fused alpha numeric codes that I wish to separate
df1 <- as.data.frame(matrix(
paste0(sample(LETTERS, 20, replace = TRUE), sample(seq(1, 7, 0.1), 20, replace = TRUE)),
nrow = 10)) %>% unite(col = "ab", sep = "")
df1
# Add a space (" ") before any captial letter encountered
df2 <- df1 %>% mutate(ab = gsub('([[:upper:]])', ' \\1', ab))
df2
# use separate to split the column based on the space
df3 <- df2 %>% separate(col=ab, into=c("a", "b"), sep = " ")
df3
When I run separate
I get a warning and the output is not correct:当我
separate
运行时,我收到警告并且 output 不正确:
#Warning message:
#Expected 2 pieces. Additional pieces discarded in 10 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10].
#> df3
# a b
#1 Y3
#2 D4.6
#3 G5
#4 A3.4
#5 B5.5
#6 W4.6
#7 K4.6
#8 N4.3
#9 R5.1
#10 S3.4
The contents intended for column "a" have been placed on column "b", whilst those intended for "b" appear to have been removed entirely.用于“a”列的内容已放在“b”列,而用于“b”的内容似乎已完全删除。
Another option is to make a more precise regex from the very beginning.另一种选择是从一开始就制作更精确的正则表达式。
Eg例如
df1 |>
separate(col = ab,
into = c("a", "b"),
sep = "(?<=\\d)(?=[[:upper:]])")
Output: Output:
a b
1 B1.8 Z4.3
2 M5 U6.7
3 N5 Q5.1
4 V4.9 B6.5
5 N4 V1.2
6 H2.8 J5.1
7 Q3.6 J1.3
8 J3.8 G2.9
9 B1.2 W4.7
10 L1.6 O3.5
This is because you create a white space before your first letter: to remove it, you can use trimws
or str_trim
:这是因为您在第一个字母之前创建了一个空格:要删除它,您可以使用
trimws
或str_trim
:
df1 %>%
mutate(ab = trimws(gsub('([[:upper:]])', ' \\1', ab))) %>%
separate(col=ab, into=c("a", "b"), sep = " ")
a b
1 Y3 A5.3
2 D4.6 U2.4
3 G5 U4.2
4 A3.4 J2.9
5 B5.5 V4.4
6 W4.6 N1.5
7 K4.6 J1.9
8 N4.3 G5.1
9 R5.1 I4.7
10 S3.4 O5.6
I later worked out that the row numbers are being included as a column and that I can get around this problem by acknowledging and deleting the "n" column:后来我发现行号被包含为一列,我可以通过确认和删除“n”列来解决这个问题:
df3 <- df2 %>% separate(col=ab, into=c("n", "a", "b"), sep = " ") %>%
select(-n)
df3
However, this is verbose, and further I can't see any previous literature or documentation describing this behaviour in separate
.但是,这很冗长,而且我看不到任何以前的文献或文档在
separate
的 . Am I missing something and is there a neater way of preventing this behaviour?我是否遗漏了什么,是否有更简洁的方法来防止这种行为?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.