简体   繁体   English

从R中的文本文件中提取列表数据

[英]Extracting list data from text file in R

There is probably already a post on this topic, but I'm not sure what terms to search. 可能已经有关于该主题的文章,但是我不确定要搜索哪些术语。 I am trying to import data from a txt file with this format (the first 2 lines of which are not of interest): 我正在尝试从具有以下格式的txt文件中导入数据(前两行不重要):

FN Clarivate Analytics Web of Science
VR 1.0
PT J
AU Ahituv, Nadav
   Zhu, Yiwen
   Visel, Axel
   Holt, Amy
   Afzal, Veena
   Pennacchio, Len A.
   Rubin, Edward M.
TI Deletion of ultraconserved elements yields viable mice
SO PLOS BIOLOGY
VL 5
IS 9
BP 1906
EP 1911
AR e234
DI 10.1371/journal.pbio.0050234
PD SEP 2007
PY 2007
RI Visel, Axel/A-9398-2009; Ahituv, Nadav/; Pennacchio, Len/
OI Visel, Axel/0000-0002-4130-7784; Ahituv, Nadav/0000-0002-7434-8144;
   Pennacchio, Len/0000-0002-8748-3732
SN 1544-9173
UT WOS:000249552300010
PM 17803355
ER

PT J
AU Ahmadiyeh, Nasim
   Pomerantz, Mark M.
   Grisanzio, Chiara
   Herman, Paula
   Jia, Li
   Almendro, Vanessa
   He, Housheng Hansen
   Brown, Myles
   Liu, X. Shirley
   Davis, Matt
   Caswell, Jennifer L.
   Beckwith, Christine A.
   Hills, Adam
   MacConaill, Laura
   Coetzee, Gerhard A.
   Regan, Meredith M.
   Freedman, Matthew L.
TI 8q24 prostate, breast, and colon cancer risk loci show tissue-specific
   long-range interaction with MYC
SO PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF
   AMERICA
VL 107
IS 21
BP 9742
EP 9746
DI 10.1073/pnas.0910668107
PD MAY 25 2010
PY 2010
RI Davis, Matt/F-9045-2012; He, Housheng/G-9614-2011; he, housheng hansen/; Caswell-Jin, Jennifer/; Brown, Myles/
OI he, housheng hansen/0000-0003-2898-3363; Caswell-Jin,
   Jennifer/0000-0002-5711-8355; Brown, Myles/0000-0002-8213-1658
SN 0027-8424
UT WOS:000278054700049
PM 20453196
ER

Since some of the categories (eg AU) have more than one object, I think I need to import as a list. 由于某些类别(例如AU)具有多个对象,因此我认为我需要作为列表导入。 The category labels are all 2 characters followed by a space, but some categories are on more than one line, and subsequent lines are not labeled with the category label. 类别标签全为2个字符,后跟一个空格,但是某些类别在多行中,并且后续的行未使用类别标签进行标记。 In addition, for some categories that take up more than one line, such as AU, I would like the data to be imported as a vector. 另外,对于某些类别占用多个行,例如AU,我希望将数据作为向量导入。 For others, suce as TI or SO, I would like to catenate the multiple lines into one object of class character in the list. 对于其他人,例如TI或SO,我想将多行分类为列表中一个类character对象。

I would like the entries to look something like this: 我希望条目看起来像这样:

print(<portion of list that corresponds to AU for first reference>)
[AU]
[[1]] "Ahituv, Nadav"      "Zhu, Yiwen"         "Visel, Axel"        "Holt, Amy"          "Afzal, Veena"      
[[6]] "Pennacchio, Len A." "Rubin, Edward M."

print(<portion of lilst that corresponds to TI and SO for second reference>)
[TI]
[[1]] "8q24 prostate, breast, and colon cancer risk loci show tissue-specific long-range interaction with MYC"
[SO]
[[1]] "PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA"

I've tried using scan() using the following code: 我尝试使用以下代码使用scan()

scan("savedrecs_spitz refs.txt", what = "character", sep = "\n")

However, what gets read in is a single character vector where each line of the txt is read in as a separate object in the vector: 但是,读入的是单个字符向量,其中txt的每一行作为向量中的单独对象被读入:

[1] "FN Clarivate Analytics Web of Science" "VR 1.0"                                  
[3] "PT J"                                     "AU Ahituv, Nadav"                        
[5] "   Zhu, Yiwen"                            "   Visel, Axel"

Is there a different function I should be using to read in these data? 我应该使用其他功能来读取这些数据吗?

Is that what you're looking for? 那是您要找的东西吗?

dt=scan("savedrecs_spitz refs.txt", what = "character", sep = "\n")

mgrep=function(dt){
  v=intersect(grep("[A-Z]{2}",dt),which(nchar(dt)==2))
  ret=list()
  for(i in 1:length(v)){
    end=ifelse(i==length(v),length(dt),(v[i+1]-1))
    st=(v[i]+1)
    ret[[i]]=dt[st:end]
  }
  names(ret)=dt[v]
  return(ret)
}
mgrep(dt)

ps: pay attention to special characters being wrongly read like "FN" that would not be properly used inside the function. ps:请注意,某些错误的字符不能正确使用,例如“FN”。

Think I have solved your problem, but I save the data in a data.frame 认为我已经解决了您的问题,但是我将数据保存在data.frame中

library(stringr)

text <- scan("text.txt",sep = "\n",what = "character")

textLoop <- grep("^[[:upper:]]|^[[:blank:]]", text, value = TRUE)

for(i in 1:length(textLoop)){
  if(grepl("^[[:blank:]]", textLoop[i])){
    partOne <- substring(textLoop[i-1], 1, 2)
    textLoop[i] <- paste0(partOne, textLoop[i])
  }
}

textDf <- data.frame(partOne = substring(textLoop, 1, 2),
                     partTwo = substring(textLoop, 4))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM