I have a .txt file that has this format:
--------------------------------------------------------------------------------------------------------------
m5a2 A2. Confirm how much time child lives with respondent
--------------------------------------------------------------------------------------------------------------
type: numeric (byte)
label: BM_101F
range: [-9,7] units: 1
unique values: 8 missing .: 0/4898
tabulation: Freq. Numeric Label
1383 -9 -9 Not in wave
4 -2 -2 Don't know
2 -1 -1 Refuse
3272 1 1 all or most of the time
29 2 2 about half of the time
76 3 3 some of the time
80 4 4 none of the time
52 7 7 only on weekends
--------------------------------------------------------------------------------------------------------------
m5a3 A3. Number of months ago child stopped living with you
--------------------------------------------------------------------------------------------------------------
type: numeric (int)
label: NUMERIC, but 44 nonmissing values are not labeled
range: [-9,120] units: 1
unique values: 47 missing .: 0/4898
examples: -9 -9 Not in wave
-6 -6 Skip
-6 -6 Skip
-6 -6 Skip
--------------------------------------------------------------------------------------------------------------
What is important to me, is the codename such as m5a2
, the description A2. Confirm how much time child lives with respondent
A2. Confirm how much time child lives with respondent
, and lastly, the values of responses
tabulation: Freq. Numeric Label
1383 -9 -9 Not in wave
4 -2 -2 Don't know
2 -1 -1 Refuse
3272 1 1 all or most of the time
29 2 2 about half of the time
76 3 3 some of the time
80 4 4 none of the time
52 7 7 only on weekends
I need to read the three items into a list for further processing.
I have tried the following, and it works on retrieving the codename and description.
fileName <- "../data/ff_mom_cb9.txt"
conn <- file(fileName,open="r")
linn <-readLines(conn)
L = list()
for (i in 1:length(linn)){
if((linn[i]=="--------------------------------------------------------------------------------------------------------------") & (linn[i+1]!=""))
{
L[i] = linn[i+1]
}
else
{
# read until hit the next dashed line
}
}
close(conn)
A few things I am confused about: 1. I have no idea how to let it read line until it hits the next the next dashed line. 2. Is my approach correct in storing the read data in a list if I want to be able to visualize search, and easily retrieve the data?
Thanks.
This will be somewhat problematic because the format is so irregular from item to item. Heres a run at the first item codebook text:
txt <- "m5a2 A2. Confirm how much time child lives with respondent
--------------------------------------------------------------------------------------------------------------
type: numeric (byte)
label: BM_101F
range: [-9,7] units: 1
unique values: 8 missing .: 0/4898
tabulation: Freq. Numeric Label
1383 -9 -9 Not in wave
4 -2 -2 Don't know
2 -1 -1 Refuse
3272 1 1 all or most of the time
29 2 2 about half of the time
76 3 3 some of the time
80 4 4 none of the time
52 7 7 only on weekends
"
Lines <- readLines( textConnection(txt))
# isolate lines with letter in first column
Lines[grep("^[a-zA-Z]", Lines)]
# Now replace long runs of spaces with commas and scan:
scan(text=sub("[ ]{10,100}", ",", Lines[grep("^[a-zA-Z]", Lines)] ),
sep=",", what="")
#----
Read 2 items
[1] "m5a2"
[2] "A2. Confirm how much time child lives with respondent"
The "tabulation" line can be used to create column labels.
colnames <- scan(text=sub(".*tabulation[:]", "",
Lines[grep("tabulation[:]", Lines)] ), sep="", what="")
#Read 3 items
The substitution-with-commas strategy need to be a bit more involved for the lines afterward. First isolate the rows where a numeric digit is the first non-space character:
dataRows <- Lines[grep("^[ ]*\\d", Lines)]
Then substitute commas for the pattern digit-2+spaces and read with read.csv:
myDat <- read.csv(text=
gsub("(\\d)[ ]{2,}", "\\1,", dataRows ),
header=FALSE ,col.names=colnames)
#------------
myDat
V1 V2 V3
1 1383 -9 -9 Not in wave
2 4 -2 -2 Don't know
3 2 -1 -1 Refuse
4 3272 1 1 all or most of the time
5 29 2 2 about half of the time
6 76 3 3 some of the time
7 80 4 4 none of the time
8 52 7 7 only on weekends
Looping over multiple items might be possible with a counter generated from cumsum( grepl("^-------", Lines)
if the Lines-object were the entire file such as the one at:
Lines <- readLines("http://fragilefamilies.princeton.edu/sites/fragilefamilies/files/ff_mom_cb9.txt")
sum( grepl("^-------", Lines) )
#----------------------
[1] 1966
Warning messages:
1: In grepl("^-------", Lines) :
input string 6995 is invalid in this locale
2: In grepl("^-------", Lines) :
input string 7349 is invalid in this locale
3: In grepl("^-------", Lines) :
input string 7350 is invalid in this locale
4: In grepl("^-------", Lines) :
input string 7352 is invalid in this locale
5: In grepl("^-------", Lines) :
input string 7353 is invalid in this locale
My "hand-held scan()-er" suggested to me that there were only two types of codebook record: "tabulations" (presumably items with fewer than 10 or so intances) and "examples"(ones with more). They had different structures (as can be seen above in your codebook fragment) so maybe only two types of parsing logic would be needed to be built and deployed. So I think the tools to do that are described above.
The warnings all relate to the character "\\x92" being used as an apostrophe. Regex and R share an escape-character "\\", so you need to escape the escapes. They could be corrected with:
Lines <- gsub("\\\x92", "'", Lines )
What about this?
df <- read.table("file.txt",
header = FALSE)
df
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.