[英]How to read .data file into R
I have tried to load the data from http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data into R using the following piece of code 我尝试使用以下代码将数据从http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data加载到R中
hData <- read.table(file.choose(), sep = "\t", dec = ",", fileEncoding = "UTF-16")
but its not populating the exact data. 但它没有填充确切的数据。 The data has 76 attributes in it and the details about it are given here: http://archive.ics.uci.edu/ml/datasets/Heart+Disease . 数据具有76个属性,有关详细信息,请参见: http : //archive.ics.uci.edu/ml/datasets/Heart+Disease 。
Can someone tell me what am I doing incorrect? 有人可以告诉我我做错了什么吗?
The file contains extra line breaks that are causing issues. 该文件包含引起问题的多余换行符。 If you chop them out with regex, you can read it in: 如果您使用正则表达式将它们砍掉,则可以阅读以下内容:
# read file into a single string
x <- readr::read_file('http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data')
# or in base, x <- paste(readLines(url('http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data')), collapse = '\n')
# gsub out line breaks that follow numbers (not "name") and read data
df <- read.table(text = gsub('(\\d)\\n', '\\1 ', x))
head(df, 2)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25
## 1 1254 0 40 1 1 0 0 -9 2 140 0 289 -9 -9 -9 0 -9 -9 0 12 16 84 0 0 0
## 2 1255 0 49 0 1 0 0 -9 3 160 1 180 -9 -9 -9 0 -9 -9 0 11 16 84 0 0 0
## V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 V44 V45 V46 V47 V48
## 1 0 0 150 18 -9 7 172 86 200 110 140 86 0 0 0 -9 26 20 -9 -9 -9 -9 -9
## 2 0 0 -9 10 9 7 156 100 220 106 160 90 0 0 1 2 14 13 -9 -9 -9 -9 -9
## V49 V50 V51 V52 V53 V54 V55 V56 V57 V58 V59 V60 V61 V62 V63 V64 V65 V66 V67 V68 V69 V70 V71
## 1 -9 -9 -9 -9 -9 -9 12 20 84 0 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 1 1 1
## 2 -9 -9 -9 -9 -9 -9 11 20 84 1 -9 -9 2 -9 -9 -9 -9 -9 -9 -9 1 1 1
## V72 V73 V74 V75 V76
## 1 1 1 -9 -9 name
## 2 1 1 -9 -9 name
If there doesn't happen to be a conveniently different data type at the end, you can use scan
to make a vector, then split
and reassemble: 如果最后没有碰巧是方便的其他数据类型,则可以使用scan
生成向量,然后split
并重新组装:
# download data and split into a character vector
x <- scan(url('http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data'), character())
# split and assemble data.frame
df <- data.frame(split(x, 1:76), stringsAsFactors = FALSE)
# fix types
df[] <- lapply(df, type.convert, as.is = TRUE)
or pass scan
a model of the types of what a single row should be to read directly into a list: 或通过scan
单个行的类型的模型以直接读取到列表中:
x <- scan(url('http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data'),
c(replicate(75, numeric()), list(character())))
df <- as.data.frame(x)
names(df) <- paste0('V', 1:76) # replace ugly names
If getting the type structure correct is too complicated, read everything in as character with replicate(76, character())
and use type.convert
like the previous option. 如果要正确设置类型结构太复杂,请使用type.convert
replicate(76, character())
所有内容读取为字符,并像上一个选项一样使用type.convert
。
Alternately, use readLines
, split
to create a list with the correct strings for each row grouped, and paste
it all back together to use read.table
: 或者,使用readLines
, split
创建一个列表,该列表具有针对分组的每一行的正确字符串, paste
其全部paste
回以使用read.table
:
x <- readLines(url('http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data'))
df <- read.table(text = paste(sapply(split(x,
rep(seq(length(x) / 10), each = 10)),
paste, collapse = ' '), collapse = '\n'))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.