简体   繁体   中英

reading text file in r and store what is read conditioned on the next line

I have a .txt file that has this format:

--------------------------------------------------------------------------------------------------------------
m5a2                                                     A2. Confirm how much time child lives with respondent
--------------------------------------------------------------------------------------------------------------

                  type:  numeric (byte)
                 label:  BM_101F

                 range:  [-9,7]                       units:  1
         unique values:  8                        missing .:  0/4898

            tabulation:  Freq.   Numeric  Label
                          1383        -9  -9 Not in wave
                             4        -2  -2 Don't know
                             2        -1  -1 Refuse
                          3272         1  1 all or most of the time
                            29         2  2 about half of the time
                            76         3  3 some of the time
                            80         4  4 none of the time
                            52         7  7 only on weekends

--------------------------------------------------------------------------------------------------------------
m5a3                                                    A3. Number of months ago child stopped living with you
--------------------------------------------------------------------------------------------------------------

                  type:  numeric (int)
                 label:  NUMERIC, but 44 nonmissing values are not labeled

                 range:  [-9,120]                     units:  1
         unique values:  47                       missing .:  0/4898

              examples:  -9    -9 Not in wave
                         -6    -6 Skip
                         -6    -6 Skip
                         -6    -6 Skip

--------------------------------------------------------------------------------------------------------------

What is important to me, is the codename such as m5a2 , the description A2. Confirm how much time child lives with respondent A2. Confirm how much time child lives with respondent , and lastly, the values of responses

tabulation:  Freq.   Numeric  Label
                          1383        -9  -9 Not in wave
                             4        -2  -2 Don't know
                             2        -1  -1 Refuse
                          3272         1  1 all or most of the time
                            29         2  2 about half of the time
                            76         3  3 some of the time
                            80         4  4 none of the time
                            52         7  7 only on weekends

I need to read the three items into a list for further processing.

I have tried the following, and it works on retrieving the codename and description.

fileName <- "../data/ff_mom_cb9.txt"
conn <- file(fileName,open="r")
linn <-readLines(conn)
L = list()
for (i in 1:length(linn)){
  if((linn[i]=="--------------------------------------------------------------------------------------------------------------") & (linn[i+1]!=""))
  {
    L[i] = linn[i+1]
  }

  else
  {
    # read until hit the next dashed line
  }
}
close(conn)

A few things I am confused about: 1. I have no idea how to let it read line until it hits the next the next dashed line. 2. Is my approach correct in storing the read data in a list if I want to be able to visualize search, and easily retrieve the data?

Thanks.

This will be somewhat problematic because the format is so irregular from item to item. Heres a run at the first item codebook text:

txt <- "m5a2                                                     A2. Confirm how much time child lives with respondent
--------------------------------------------------------------------------------------------------------------

                  type:  numeric (byte)
                 label:  BM_101F

                 range:  [-9,7]                       units:  1
         unique values:  8                        missing .:  0/4898

            tabulation:  Freq.   Numeric  Label
                          1383        -9  -9 Not in wave
                             4        -2  -2 Don't know
                             2        -1  -1 Refuse
                          3272         1  1 all or most of the time
                            29         2  2 about half of the time
                            76         3  3 some of the time
                            80         4  4 none of the time
                            52         7  7 only on weekends
"
Lines <- readLines( textConnection(txt))
 # isolate lines with letter in first column
 Lines[grep("^[a-zA-Z]", Lines)]
# Now replace long runs of spaces with commas and scan:

scan(text=sub("[ ]{10,100}", ",", Lines[grep("^[a-zA-Z]", Lines)] ),
     sep=",", what="")
#----
Read 2 items
[1] "m5a2"                                                 
[2] "A2. Confirm how much time child lives with respondent"

The "tabulation" line can be used to create column labels.

colnames <- scan(text=sub(".*tabulation[:]", "",
                     Lines[grep("tabulation[:]", Lines)] ), sep="", what="")
#Read 3 items

The substitution-with-commas strategy need to be a bit more involved for the lines afterward. First isolate the rows where a numeric digit is the first non-space character:

dataRows <- Lines[grep("^[ ]*\\d", Lines)]

Then substitute commas for the pattern digit-2+spaces and read with read.csv:

 myDat <- read.csv(text=  
                      gsub("(\\d)[ ]{2,}", "\\1,", dataRows ), 
                   header=FALSE ,col.names=colnames)

#------------
 myDat
    V1 V2                        V3
1 1383 -9            -9 Not in wave
2    4 -2             -2 Don't know
3    2 -1                 -1 Refuse
4 3272  1 1 all or most of the time
5   29  2  2 about half of the time
6   76  3        3 some of the time
7   80  4        4 none of the time
8   52  7        7 only on weekends

Looping over multiple items might be possible with a counter generated from cumsum( grepl("^-------", Lines) if the Lines-object were the entire file such as the one at:

 Lines <- readLines("http://fragilefamilies.princeton.edu/sites/fragilefamilies/files/ff_mom_cb9.txt")
sum( grepl("^-------", Lines) )
#----------------------
[1] 1966
Warning messages:
1: In grepl("^-------", Lines) :
  input string 6995 is invalid in this locale
2: In grepl("^-------", Lines) :
  input string 7349 is invalid in this locale
3: In grepl("^-------", Lines) :
  input string 7350 is invalid in this locale
4: In grepl("^-------", Lines) :
  input string 7352 is invalid in this locale
5: In grepl("^-------", Lines) :
  input string 7353 is invalid in this locale

My "hand-held scan()-er" suggested to me that there were only two types of codebook record: "tabulations" (presumably items with fewer than 10 or so intances) and "examples"(ones with more). They had different structures (as can be seen above in your codebook fragment) so maybe only two types of parsing logic would be needed to be built and deployed. So I think the tools to do that are described above.

The warnings all relate to the character "\\x92" being used as an apostrophe. Regex and R share an escape-character "\\", so you need to escape the escapes. They could be corrected with:

Lines <- gsub("\\\x92", "'", Lines )

What about this?

df <- read.table("file.txt", 
             header = FALSE)
df

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM