簡體   English   中英

在R上將txt文件合並成一個dataframe

[英]Merging txt file into a dataframe on R

我有一個包含 100,000 多行數據的 txt 文件。 我想把它變成 dataframe 但不需要每一行數據。 數據條目的示例如下所示:

FN Clarivate Analytics Web of Science
VR 1.0
PT J
AU Yang, Qiang
   Liu, Yang
   Chen, Tianjian
   Tong, Yongxin
TI Federated Machine Learning: Concept and Applications
SO ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY
VL 10
IS 2
AR 12
DI 10.1145/3298981
DT Article
PD FEB 2019
PY 2019
AB Today's artificial intelligence still faces two major challenges (...) etc. 

我只想要以 TI、AU、PD、AB 開頭的行,並將它們提取到相應的命名列中。 這也是我所得到的,我真的很掙扎!

read.table("groupprojectdatabase.txt", header = FALSE, sep = ",", quote = "",
           dec = ".", numerals = c("allow.loss"),
           row.names = c("TI", "AU", "PB","AB"), col.names = c('title_col','author_col','date_col','summary_col'), as.is = !stringsAsFactors,
           na.strings = "NA", colClasses = NA, nrows = -1,
           skip = 0, check.names = TRUE, fill = FALSE,
           strip.white = FALSE, blank.lines.skip = TRUE,
           comment.char = "#",
           allowEscapes = FALSE, flush = FALSE,
           stringsAsFactors = FALSE,
           fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)

任何幫助將不勝感激,即使這是我需要查找的功能或者我是否在正確的軌道上。 我在想 sep = 命令是相關的,但我不知道如何告訴它跳過除 TI、AU、PB 和 AB 行之外的所有內容

特別是我不確定如何對 R 進行編程以將整個句子視為變量,而不是每個單詞等。

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  line 1 did not have 4 elements

我根據您上面的數據制作了一個文件test.txt 在使用read.table遇到一些問題后,我從 tidyverse 切換到read::read_delim tidyverse

這將逐行讀取文件。 該行然后由第一個whitespace分隔,即在前 2 個字母之后。

因為有 4 行(AU 前兩個字母)屬於一起,所以下面代碼的最后部分將這些行放在一起。

library(tidyverse)

df <- read_delim("path_to_your/test.txt", delim = ";", col_names = TRUE)

ddf <- df |> 
  separate(`FN Clarivate Analytics Web of Science`, 
           into = c("first", "rest"), 
           sep = " ", extra = 'merge') |> 
  mutate(first = ifelse(first == "", NA, first)) |> 
  fill(first) |> 
  group_by(first) |> 
  mutate(rest = paste0(rest, collapse = "")) |> 
  distinct(first, .keep_all = T)
  
ddf |> 
  filter(first %in% c('TI', 'AU', 'PD', 'AB'))

#> # A tibble: 4 × 2
#> # Groups:   first [4]
#>   first rest                                                            
#>   <chr> <chr>                                                           
#> 1 AU    Yang, Qiang  Liu, Yang  Chen, Tianjian  Tong, Yongxin           
#> 2 TI    Federated Machine Learning: Concept and Applications            
#> 3 PD    FEB 2019                                                        
#> 4 AB    Today's artificial intelligence still faces two major challenges

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM