简体   繁体   English

r-根据一个固定的文本将单列数据帧转换为带有行的数据帧

[英]r - convert single column data frame to data frame with rows based on one fixed text

Update 1 更新1

Linking the actual dataset since the solutions given for the example data are not working out for me. 链接实际数据集,因为为示例数据提供的解决方案对我而言不可行。

Link: https://app.box.com/s/65j1enr13pi51i44mfrymccklw1artot 链接: https//app.box.com/s/65j1enr13pi51i44mfrymccklw1artot

Please note that LOT is the end of the row marker. 请注意, LOT是行标记的结尾。

-- -

I've data frame like the following (single column): 我的数据框如下所示(单列):

D
2
f
h
k
END_ROW_WORD
k
1
2
END_ROW_WORD
e
g
j
2
k
END_ROW_WORD

I'd like to convert it into following format: 我想将其转换为以下格式:

在此处输入图片说明

As you can see there is a specific word (END_ROW_WORD) that marks the end of the row. 如您所见,有一个特定的单词(END_ROW_WORD)标记该行的结尾。

Here is a similar approach to Alejandro's, but using split instead of a for loop: 这是与Alejandro相似的方法,但是使用split而不是for循环:

colstarts <- diff(c(0, which(df == "END_ROW_WORD")))
rows <- split(df[[1]], rep(1:length(colstarts), colstarts))
rows <- lapply(rows, `length<-`, max(lengths(rows)))
as.data.frame(do.call(rbind, rows))

A solution without for -loops, but with stringr 没有for -loops,但是有stringr解决方案

library(stringr)
new_text <- str_c(df$V1, collapse = " ")
new_text <- str_replace_all(new_text, "END_ROW_WORD", "END_ROW_WORD\n")
read.table(text = new_text, fill = T)

#   V1 V2 V3           V4 V5           V6
# 1  D  2  f            h  k END_ROW_WORD
# 2  k  1  2 END_ROW_WORD                
# 3  e  g  j            2  k END_ROW_WORD

Data 数据

df <- 
  structure(list(V1 = structure(c(3L, 2L, 6L, 8L, 10L, 5L, 10L, 1L, 2L, 5L, 4L, 7L, 9L, 2L, 10L, 5L),
                                .Label = c("1", "2", "D", "e", "END_ROW_WORD", "f", "g", "h", "j", "k"),
                                class = "factor")),
            .Names = "V1", class = "data.frame", row.names = c(NA, -16L))

This might not be the best way to do it but it works 这可能不是最好的方法,但是可以

pos_help = which(grepl("END_ROW_WORD",data))

d = list()
for(i in 1:length(pos_help)){
  if(i == 1){
    d[[i]] = data[1:pos_help[1]]
  } else {
    d[[i]] = data[(pos_help[i-1]+1):pos_help[i]]
  }
}
dataFrame = do.call(rbind,lapply(d, "length<-", max(lengths(d))))

This first puts a newline character, "\\n" , after every "END_ROW_WORD" marker, then pastes the result into a long character string. 首先,在每个"END_ROW_WORD"标记之后放置换行符"\\n" ,然后将结果粘贴到长字符串中。
Then, it uses read.table to read the data in from a text connection. 然后,它使用read.table从文本连接中读取数据。

end <- "END_ROW_WORD"

inx <- c(0, grep(end, dat[[1]]))
s <- NULL
for(i in seq_along(inx)[-1]){
    s <- c(s, dat[[1]][(inx[(i - 1)] + 1):inx[i]], "\n")
}

con <- textConnection(paste(s, collapse = " "))
result <- read.table(con, fill = TRUE)
close(con)
result
#  V1 V2 V3           V4 V5           V6
#1  D  2  f            h  k END_ROW_WORD
#2  k  1  2 END_ROW_WORD                
#3  e  g  j            2  k END_ROW_WORD

DATA. 数据。

dat <-
structure(list(V1 = c("D", "2", "f", "h", "k", "END_ROW_WORD", 
"k", "1", "2", "END_ROW_WORD", "e", "g", "j", "2", "k", "END_ROW_WORD"
)), .Names = "V1", class = "data.frame", row.names = c(NA, -16L
))

EDIT. 编辑。

After the question's edit by the OP, I revised the code to see if that file can be properly read into a data.frame . OP对该问题进行编辑后,我修改了代码,以查看该文件是否可以正确读取到data.frame The main difficulty is that the file has many non printable characters, and read.table was having trouble getting to the end of the file. 主要困难在于该文件具有许多不可打印的字符,而read.table在到达文件末尾时遇到了麻烦。

Credits to the solution of this problem go to the accepted answer in read.csv warning 'EOF within quoted string' prevents complete reading of file . 积分这个问题的解决去接受的答案在read.csv警告“引用的字符串内EOF”防止文件的完整阅读 I upvoted both the question and that answer. 我赞成这个问题和那个答案。

Credits must also be given to @kath, in the answer the idea of using a string replace to put newline characters as EOL markers is much better than my ugly for loop above. 积分也必须给予@kath,在回答使用字符串替换把换行字符作为EOL标志的想法比我丑更好for上述循环。 Unlike kath, I use base R only, I don't find it necessary to load an external package. 与kath不同,我仅使用base R ,我认为没有必要加载外部软件包。

Now the revised code. 现在修改代码。

# Use this first pattern if AUCTION also marks the end of a row
#pattern <- "(^LOT|^AUCTION)"
pattern <- "(^LOT)"

dat <- readLines("data_.csv")
s <- gsub("[[:cntrl:]]", "", dat)
s <- sub(pattern, "\\1\n", s)

con <- textConnection(paste(s, collapse = "\t"))
result <- read.table(con, sep = "\t", fill = TRUE, quote = "", row.names = NULL)
close(con)

head(result)
tail(result)
str(result)

I thought that there would be some empty rows, so I checked it with the following code. 我以为会有一些空行,所以我用下面的代码检查了一下。

#
# See if there are any empty rows
#
empty <- apply(result, 1, function(x) nchar(trimws(paste0(x, collapse = ""))) == 0)
sum(empty)
#[1] 0

without loop, but using map and split.... (because why not :p ) 没有循环,但是使用map和split...。(因为为什么不:p)

library(tidyverse)
df <- tibble(x=c(
  "D",
  "2",
  "f",
  "h",
  "k",
  "END_ROW_WORD",
  "k",
  "1",
  "2",
  "END_ROW_WORD",
  "e",
  "g",
  "j",
  "2",
  "k",
  "END_ROW_WORD"
)  

)
split(df,cut(1:16,breaks=c(0,which(df == "END_ROW_WORD")))) %>%
  map_dfc(~rbind(.x,tibble(x=rep(NA,(6-nrow(.x)))))) %>% 
  t() %>% as.data.frame()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据一列中的值对R数据框中的行进行排序 - ordering rows in an R data frame based on value in one column 基于单个数据框中的列匹配行 - matching rows based on a column in a single data frame R:从一个数据框中提取行,基于列名匹配另一个数据框中的值 - R: Extract Rows from One Data Frame, Based on Column Names Matching Values from Another Data Frame 根据 r 中的固定行顺序向数据框添加分类变量 - add a categorical variable to data frame based on fixed rows order in r 根据R中的其他行和列组合在数据框中创建行 - Create rows in a data frame based on other rows and column combination in R 如果一列值基于 R 数据帧中的另一列匹配,则过滤行 - filter rows if one column values matches based on another column in R data frame 将数据帧的不同行转换为R中的一行 - converting different rows of a data frame to one single row in R 查找数据框中的行,其中一列中的文本可以在 R 中的另一列中找到 - Find rows in a data frame where the text in one column can be found in another column, in R 将多个列表转换为 r 中的单个数据帧 - Convert the Multiple Lists into one single data Frame in r R:根据另一列操作一个数据框列的值 - R: Manipulate values of one data frame column based on another column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM