R阅读日期中的read.csv不同

Question

I have two very similar csv files. 我有两个非常相似的csv文件。 Stock prices for 2 different stocks downloaded from the same source in the same format. 从相同来源以相同格式下载的2种不同股票的股票价格。 However, read.csv in R is reading them differently. 但是，R中的read.csv读取它们的方式有所不同。

> tab1=read.csv(path1)
> tab2=read.csv(path2)

> head(tab1)
        Date   Open   High    Low  Close  Volume Adj.Close
1 2014-12-01 158.35 162.92 157.12 157.12 2719100  156.1488
2 2014-11-03 153.14 160.86 152.98 160.09 2243400  159.1004
3 2014-10-01 141.16 154.44 130.60 153.77 3825900  152.0036
4 2014-09-02 143.30 147.87 140.66 141.68 2592900  140.0525
5 2014-08-01 140.15 145.39 138.43 144.00 2027100  142.3459
6 2014-07-01 143.41 146.43 140.60 140.89 2131100  138.4461

> head(tab2)
       Date  Open  High   Low Close  Volume Adj.Close
1 12/1/2014 73.39 75.20 71.75 72.29 1561400  71.92211
2 11/3/2014 69.28 74.92 67.88 73.74 1421600  72.97650
3 10/1/2014 66.18 74.95 63.42 69.21 1775400  68.49341
4  9/2/2014 68.34 68.57 65.49 66.32 1249200  65.63333
5  8/1/2014 67.45 68.99 65.88 68.26 1655400  67.20743
6  7/1/2014 64.07 69.50 63.09 67.46 1733600  66.41976

If I try to use colClasses in read.csv then the dates for the second table are read incorrectly. 如果我尝试在read.csv中使用colClasses，那么第二张表的日期将被错误地读取。

> tab1=read.csv(path1,colClasses=c("Date",rep("numeric",6)))
> tab2=read.csv(path2,colClasses=c("Date",rep("numeric",6)))

> head(tab1)
        Date   Open   High    Low  Close  Volume Adj.Close
1 2014-12-01 158.35 162.92 157.12 157.12 2719100  156.1488
2 2014-11-03 153.14 160.86 152.98 160.09 2243400  159.1004
3 2014-10-01 141.16 154.44 130.60 153.77 3825900  152.0036
4 2014-09-02 143.30 147.87 140.66 141.68 2592900  140.0525
5 2014-08-01 140.15 145.39 138.43 144.00 2027100  142.3459
6 2014-07-01 143.41 146.43 140.60 140.89 2131100  138.4461

> head(tab2)
        Date  Open  High   Low Close  Volume Adj.Close
1 0012-01-20 73.39 75.20 71.75 72.29 1561400  71.92211
2 0011-03-20 69.28 74.92 67.88 73.74 1421600  72.97650
3 0010-01-20 66.18 74.95 63.42 69.21 1775400  68.49341
4 0009-02-20 68.34 68.57 65.49 66.32 1249200  65.63333
5 0008-01-20 67.45 68.99 65.88 68.26 1655400  67.20743
6 0007-01-20 64.07 69.50 63.09 67.46 1733600  66.41976

Not sure how I can make this issue reproducible without attaching the .csv files. 不知道如何在不附加.csv文件的情况下使此问题可重现。 I'm attaching snapshots of the two files. 我要附加两个文件的快照。 Any help will be appreciated. 任何帮助将不胜感激。 TAB1 TAB2

Thanks 谢谢

Answer 1

This can be solved by reading in the dates as a character vector and then calling strptime() inside transform() : 这可以通过将日期作为字符向量读取，然后在transform()内调用strptime()来解决：

transform(read.csv(path2,colClasses=c('character',rep('numeric',6))),Date=as.Date(strptime(Date,'%m/%d/%Y')));
##         Date  Open  High   Low Close  Volume Adj.Close
## 1 2014-12-01 73.39 75.20 71.75 72.29 1561400  71.92211
## 2 2014-11-03 69.28 74.92 67.88 73.74 1421600  72.97650
## 3 2014-10-01 66.18 74.95 63.42 69.21 1775400  68.49341
## 4 2014-09-02 68.34 68.57 65.49 66.32 1249200  65.63333
## 5 2014-08-01 67.45 68.99 65.88 68.26 1655400  67.20743
## 6 2014-07-01 64.07 69.50 63.09 67.46 1733600  66.41976

Edit: You can try to "detect" the date format dynamically using your own assumptions, but this will only be as reliable as your assumptions: 编辑：您可以尝试使用自己的假设动态地“检测”日期格式，但这只会与您的假设一样可靠：

readStockData <- function(path) {
    tab <- read.csv(path,colClasses=c('character',rep('numeric',6)));
    tab$Date <- as.Date(tab$Date,if (grepl('^\\d+/\\d+/\\d+$',tab$Date[1])) '%m/%d/%Y' else '%Y-%m-%d');
    tab;
};
readStockData(path1);
##         Date   Open   High    Low  Close  Volume Adj.Close
## 1 2014-12-01 158.35 162.92 157.12 157.12 2719100  156.1488
## 2 2014-11-03 153.14 160.86 152.98 160.09 2243400  159.1004
## 3 2014-10-01 141.16 154.44 130.60 153.77 3825900  152.0036
## 4 2014-09-02 143.30 147.87 140.66 141.68 2592900  140.0525
## 5 2014-08-01 140.15 145.39 138.43 144.00 2027100  142.3459
## 6 2014-07-01 143.41 146.43 140.60 140.89 2131100  138.4461
readStockData(path2);
##         Date  Open  High   Low Close  Volume Adj.Close
## 1 2014-12-01 73.39 75.20 71.75 72.29 1561400  71.92211
## 2 2014-11-03 69.28 74.92 67.88 73.74 1421600  72.97650
## 3 2014-10-01 66.18 74.95 63.42 69.21 1775400  68.49341
## 4 2014-09-02 68.34 68.57 65.49 66.32 1249200  65.63333
## 5 2014-08-01 67.45 68.99 65.88 68.26 1655400  67.20743
## 6 2014-07-01 64.07 69.50 63.09 67.46 1733600  66.41976

In the above I've made the assumption that there is at least one record in the file and that all records use the same Date format, thus the first Date value ( tab$Date[1] ) can be used for the detection. 在上面的假设中，文件中至少有一条记录，并且所有记录都使用相同的Date格式，因此可以将第一个Date值（ tab$Date[1] ）用于检测。

R阅读日期中的read.csv不同

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-04-28 19:59:08

R阅读日期中的read.csv不同

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-04-28 19:59:08

解决方案1
1 已采纳 2015-04-28 19:59:08