简体   繁体   中英

Reading tab delimited file in R missing rows

I'm attempting to read this file into R: https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/21447# (the commoncontent2012.tab file)

When I use read.delim() everything at first seems ok. However, there are only about two-thirds of the observations that there should be. When using read.table() it imports the correct number of rows. However, there are other problems with the column names.

The file (I thought) you mentioned is not a tab-separated file, despite what the website might lead you to assume. It is a Stata-formatted file with an extension of '.dta' so use read.dta from package foreign:

 require(foreign)
 inp <- read.dta("~/Downloads/commoncontent2012.dta")
 str(inp)
# a really "wide" file
'data.frame':   54535 obs. of  479 variables:
 $ V101                           : int  162390854 162397903 162377974 164027062 164852532 166088596 162312322 162347328 162138459 162263731 ...
 $ V103                           : num  0.213 0.572 0.371 0.511 0.788 ...
 $ comptype                       : Factor w/ 13 levels "Windows Desktop",..: 2 1 1 1 2 1 1 1 2 2 ...
 $ inputzip                       : int  NA NA 92637 NA NA NA 33914 NA NA NA ...
 $ birthyr                        : int  1928 1947 1923 1967 1944 1956 1937 1931 1956 1954 ...
 $ gender                         : Factor w/ 4 levels "Male","Female",..: 1 1 2 2 1 1 2 1 1 1 ...
 $ educ                           : Factor w/ 8 levels "No HS","High school graduate",..: 6 5 6 3 6 5 3 2 3 6 ...
 $ race                           : Factor w/ 10 levels "White","Black",..: 1 1 1 1 3 1 1 1 1 1 ...
 $ hispanic                       : Factor w/ 4 levels "Yes","No","Skipped",..: 2 2 2 2 NA 2 2 2 2 2 ...
 $ votereg                        : Factor w/ 5 levels "Yes","No","Don't know",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ regzip                         : int  NA NA NA NA NA NA NA NA NA NA ...
 # snipped the rest of the output

But then I also looked at the file named dataverse.zip that when expanded included a commoncontent.tab file. When read with read.delim I get:

> inp2 <- read.delim("~/Downloads/dataverse_files/commoncontent2012.tab")
> str(inp2)
'data.frame':   30140 obs. of  479 variables:
 $ V101                           : int  162390854 162397903 162377974 164027062 164852532 166088596 162312322 162347328 162138459 162263731 ...
 $ V103                           : num  0.213 0.572 0.371 0.511 0.788 ...
 $ comptype                       : int  2 1 1 1 2 1 1 1 2 2 ...
 $ inputzip                       : int  NA NA 92637 NA NA NA 33914 NA NA NA ...
 $ birthyr                        : Factor w/ 78 levels "__NA__","1918",..: 12 31 7 51 28 40 21 15 40 38 ...
 $ gender                         : int  1 1 2 2 1 1 2 1 1 1 ...
 $ educ                           : int  6 5 6 3 6 5 3 2 3 6 ...
 $ race                           : int  1 1 1 1 3 1 1 1 1 1 ...
# rest of output deleted

So how does this compare with what you think should be in these files or what you are seeing, since you didn't say precisely what your problems were.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM