将行号保留在数据框列中

Question

我在文件夹中有一堆.txt文件（文章），我使用for循环以便从R上的所有文件中获取文本

input_loc <- "C:/Users/User/Desktop/Folder"
files <- dir(input_loc, full.names = TRUE)
text <- c()
for (f in files) {
  text <- c(text, paste(readLines(f), collapse = "\n"))
}

从这里开始，我对每个段落进行标记化，并在每篇文章中获取每个段落：

paragraphs <- tokenize_paragraphs(text)
sapply(paragraphs, length)
paragraphs

然后我取消列出并转换为数据框

par_unlisted<-unlist(paragraphs)
par_unlisted
par_unlisted_df<-as.data.frame(par_unlisted)

但是这样做的话，我不再需要在段落间进行段落编号的分隔（例如，第一篇文章有6个段落，在取消列出第二篇文章的第一段之前，它仍然会在前面加上[1]，而在取消列出之后，它会有一个[1]。 7]）。 我想做的是，一旦有了数据框，就用一列带有段落编号的列，然后创建另一个带有文章编号的名为“ article”的列。 先感谢您

编辑这大概是我进入paragraphs得到的：

> paragraphs
[[1]]
[1] "The Miami Dolphins have decided to use their non-exclusive franchise 
tag on wide receiver Jarvis Landry."                                                                                                                                                                                                                                         

[2] "The Dolphins tweeted the announcement Tuesday, the first day teams 
could use their franchise or transition tags. The salary for wide receivers 
getting the franchise tag this offseason is expected to be around $16.2 
million, which will be quite the raise for Landry, who made $894,000 last 
season."    
[[2]]
[1] "Despite months of little-to-no movement on contract negotiations, 
Jarvis Landry has often stated his desire to stay in Miami."                                                                                                                                                                                                                                                                                                  

[2] "The Dolphins used their lone tool to wipe away negotation-driven stress 
-- at least in the immediate future -- and ensure Landry won't be lured away 
from Miami, placing the franchise tag on the receiver on Tuesday, the team 
announced."

我想将段落编号（ [n] ）保留为数据框中的一列，因为当我取消列出它们时，它们不再按文章然后按段落分隔开，而是按顺序排列（基本上在我刚刚发布的示例

[[1]]
[1] ...
[2] ...

[[2]]
[1] ...
[2] ...

但我明白了

[1] ...
[2] ...
[3] ...
[4] ...

Answer 1

考虑遍历段落列表，并构建具有所需文章和段落编号的数据框列表，并在最后一行绑定所有数据框元素。

输入数据

paragraphs <- list(
     c("The Miami Dolphins have decided to use their non-exclusive franchise tag on wide receiver Jarvis Landry.",   
        "The Dolphins tweeted the announcement Tuesday, the first day teams could use their franchise or transition tags. The salary for wide receivers 
getting the franchise tag this offseason is expected to be around $16.2 million, which will be quite the raise for Landry, who made $894,000 last 
season."),
     c("Despite months of little-to-no movement on contract negotiations, Jarvis Landry has often stated his desire to stay in Miami.",
      "The Dolphins used their lone tool to wipe away negotation-driven stress -- at least in the immediate future -- and ensure Landry won't be lured away 
from Miami, placing the franchise tag on the receiver on Tuesday, the team announced."))

数据框构建

df_list <- lapply(seq_along(paragraphs), function(i)

  setNames(data.frame(i, 1:length(paragraphs[[i]]), paragraphs[[i]]), 
           c("article_num", "paragraph_num", "paragraph"))      
)

final_df <- do.call(rbind, df_list)

输出结果

final_df

#   article_num paragraph_num                                             paragraph
# 1           1             1 The Miami Dolphins have decided to use their non-e...
# 2           1             2 The Dolphins tweeted the announcement Tuesday, the...
# 3           2             1 Despite months of little-to-no movement on contrac...
# 4           2             2 The Dolphins used their lone tool to wipe away neg...

将行号保留在数据框列中

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-02-21 15:20:28

将行号保留在数据框列中

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-02-21 15:20:28

解决方案1
0 已采纳 2018-02-21 15:20:28