简体   繁体   English

在R行中读取制表符分隔的文件

[英]Reading tab delimited file in R missing rows

I'm attempting to read this file into R: https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/21447# (the commoncontent2012.tab file) 我正在尝试将此文件读入R: https ://dataverse.harvard.edu/dataset.xhtml?persistentId = hdl:1902.1/21447#(commoncontent2012.tab文件)

When I use read.delim() everything at first seems ok. 当我使用read.delim()一切最初看起来都还不错。 However, there are only about two-thirds of the observations that there should be. 但是,应该只有大约三分之二的观察结果。 When using read.table() it imports the correct number of rows. 使用read.table()它将导入正确的行数。 However, there are other problems with the column names. 但是,列名还有其他问题。

The file (I thought) you mentioned is not a tab-separated file, despite what the website might lead you to assume. 尽管网站可能会引导您进行假设,但您提到的文件(我认为)不是制表符分隔的文件。 It is a Stata-formatted file with an extension of '.dta' so use read.dta from package foreign: 它是Stata格式的文件,扩展名为'.dta',因此请使用read.dta包中的read.dta

 require(foreign)
 inp <- read.dta("~/Downloads/commoncontent2012.dta")
 str(inp)
# a really "wide" file
'data.frame':   54535 obs. of  479 variables:
 $ V101                           : int  162390854 162397903 162377974 164027062 164852532 166088596 162312322 162347328 162138459 162263731 ...
 $ V103                           : num  0.213 0.572 0.371 0.511 0.788 ...
 $ comptype                       : Factor w/ 13 levels "Windows Desktop",..: 2 1 1 1 2 1 1 1 2 2 ...
 $ inputzip                       : int  NA NA 92637 NA NA NA 33914 NA NA NA ...
 $ birthyr                        : int  1928 1947 1923 1967 1944 1956 1937 1931 1956 1954 ...
 $ gender                         : Factor w/ 4 levels "Male","Female",..: 1 1 2 2 1 1 2 1 1 1 ...
 $ educ                           : Factor w/ 8 levels "No HS","High school graduate",..: 6 5 6 3 6 5 3 2 3 6 ...
 $ race                           : Factor w/ 10 levels "White","Black",..: 1 1 1 1 3 1 1 1 1 1 ...
 $ hispanic                       : Factor w/ 4 levels "Yes","No","Skipped",..: 2 2 2 2 NA 2 2 2 2 2 ...
 $ votereg                        : Factor w/ 5 levels "Yes","No","Don't know",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ regzip                         : int  NA NA NA NA NA NA NA NA NA NA ...
 # snipped the rest of the output

But then I also looked at the file named dataverse.zip that when expanded included a commoncontent.tab file. 但后来我又看了看文件名为dataverse.zip展开时包括了commoncontent.tab文件。 When read with read.delim I get: 当使用read.delim读取时,我得到:

> inp2 <- read.delim("~/Downloads/dataverse_files/commoncontent2012.tab")
> str(inp2)
'data.frame':   30140 obs. of  479 variables:
 $ V101                           : int  162390854 162397903 162377974 164027062 164852532 166088596 162312322 162347328 162138459 162263731 ...
 $ V103                           : num  0.213 0.572 0.371 0.511 0.788 ...
 $ comptype                       : int  2 1 1 1 2 1 1 1 2 2 ...
 $ inputzip                       : int  NA NA 92637 NA NA NA 33914 NA NA NA ...
 $ birthyr                        : Factor w/ 78 levels "__NA__","1918",..: 12 31 7 51 28 40 21 15 40 38 ...
 $ gender                         : int  1 1 2 2 1 1 2 1 1 1 ...
 $ educ                           : int  6 5 6 3 6 5 3 2 3 6 ...
 $ race                           : int  1 1 1 1 3 1 1 1 1 1 ...
# rest of output deleted

So how does this compare with what you think should be in these files or what you are seeing, since you didn't say precisely what your problems were. 因此,这与您在这些文件中应该看到的内容或所看到的内容相比如何,因为您没有确切地说出问题所在。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM