简体   繁体   English

使用R / python和SSD进行数据分析

[英]Data analysis using R/python and SSDs

Does anyone have any experience using r/python with data stored in Solid State Drives. 有没有人使用r / python与存储在固态硬盘中的数据有任何经验。 If you are doing mostly reads, in theory this should significantly improve the load times of large datasets. 如果您主要进行读取操作,理论上这应该会显着改善大型数据集的加载时间。 I want to find out if this is true and if it is worth investing in SSDs for improving the IO rates in data intensive applications. 我想知道这是否属实,是否值得投资SSD以提高数据密集型应用程序的IO速率。

My 2 cents: SSD only pays off if your applications are stored on it, not your data. 我的2美分:SSD只有在你的应用程序存储在它上面时才会得到回报,而不是你的数据。 And even then only if a lot of access to disk is necessary, like for an OS. 即便如此,只有在需要大量磁盘访问时才需要,例如操作系统。 People are right to point you to profiling. 人们指向你进行剖析是正确的。 I can tell you without doing it that almost all of the reading time goes to processing, not to reading on the disk. 我可以告诉你,没有这样做,几乎所有的阅读时间都是处理,而不是读取磁盘。

It pays off far more to think about the format of your data instead of where it's stored. 考虑数据格式而不是存储位置,可以获得更多回报。 A speedup in reading your data can be obtained by using the right applications and the right format. 通过使用正确的应用程序和正确的格式,可以获得读取数据的加速。 Like using R's internal format instead of fumbling around with text files. 就像使用R的内部格式而不是使用文本文件一样。 Make that an exclamation mark: never keep on fumbling around with text files. 把它作为一个惊叹号:永远不要继续使用文本文件。 Go binary if speed is what you need. 如果您需要速度,请转二进制。

Due to the overhead, it generally doesn't make a difference if you have an SSD or a normal disk to read your data from. 由于开销,如果您有SSD或普通磁盘来读取数据,通常没有区别。 I have both, and use the normal disk for all my data. 我有两个,并使用普通磁盘的所有数据。 I do juggle around big datasets sometimes, and never experienced a problem with it. 我有时会在大数据集周围玩耍,从未遇到过问题。 Off course, if I have to go really heavy, I just work on our servers. 当然,如果我必须非常沉重,我只需要在我们的服务器上工作。

So it might make a difference when we're talking gigs and gigs of data, but even then I doubt very much that disk access is the limiting factor. 因此,当我们谈论数据演出时,它可能会有所不同,但即便如此,我仍然非常怀疑磁盘访问是限制因素。 Unless your continuously reading and writing to the disk, but then I'd say you should start thinking again about what exactly you're doing. 除非你不断读写磁盘,否则我会说你应该再开始思考你究竟在做什么。 Instead of spending that money on SDD drives, extra memory could be the better option. 而不是将钱花在SDD驱动器上,额外的内存可能是更好的选择。 Or just convince the boss to get you a decent calculation server. 或者只是说服老板给你一个不错的计算服务器。

A timing experiment using a bogus data frame, and reading and writing in text format vs. binary format on a SSD disk vs. a normal disk. 使用虚假数据帧的定时实验,以及在SSD磁盘上与普通磁盘上以文本格式对二进制格式进行读写。

> tt <- 100
> longtext <- paste(rep("dqsdgfmqslkfdjiehsmlsdfkjqsefr",1000),collapse="")
> test <- data.frame(
+     X1=rep(letters,tt),
+     X2=rep(1:26,tt),
+     X3=rep(longtext,26*tt)
+ )

> SSD <- "C:/Temp" # My ssd disk with my 2 operating systems on it.
> normal <- "F:/Temp" # My normal disk, I use for data

> # Write text 
> system.time(write.table(test,file=paste(SSD,"test.txt",sep="/")))
   user  system elapsed 
   5.66    0.50    6.24 

> system.time(write.table(test,file=paste(normal,"test.txt",sep="/")))
   user  system elapsed 
   5.68    0.39    6.08 

> # Write binary
> system.time(save(test,file=paste(SSD,"test.RData",sep="/")))
   user  system elapsed 
      0       0       0 

> system.time(save(test,file=paste(normal,"test.RData",sep="/")))
   user  system elapsed 
      0       0       0 

> # Read text 
> system.time(read.table(file=paste(SSD,"test.txt",sep="/"),header=T))
   user  system elapsed 
   8.57    0.05    8.61 

> system.time(read.table(file=paste(normal,"test.txt",sep="/"),header=T))
   user  system elapsed 
   8.53    0.09    8.63 

> # Read binary
> system.time(load(file=paste(SSD,"test.RData",sep="/")))
   user  system elapsed 
      0       0       0 

> system.time(load(file=paste(normal,"test.RData",sep="/")))
   user  system elapsed 
      0       0       0 

http://www.codinghorror.com/blog/2010/09/revisiting-solid-state-hard-drives.html has a good article on SSDs, comments offer alot of insights. http://www.codinghorror.com/blog/2010/09/revisiting-solid-state-hard-drives.html有一篇关于SSD的好文章,评论提供了很多见解。

Depends on the type of analysis you're doing, whether it's CPU bound or IO bound. 取决于您正在进行的分析类型,无论是CPU绑定还是IO绑定。 Personal experience dealing with regression modelling tells me former is more often the case, SSDs wouldn't be of much use then. 处理回归建模的个人经验告诉我,以前更常见的情况是,SSD不会有太多用处。

In short, best to profile your application first. 简而言之,最好先对您的应用进行分析。

Sorry but I have to disagree with most rated answer by @joris. 对不起,但我不同意@joris的评分最高的答案。 It's true that if you run that code, binary version almost takes zero time to be written. 确实,如果运行该代码,二进制版本几乎不需要写入时间。 But that's because the test set is weird. 但那是因为测试集很奇怪。 The big columm 'longtext' is the same for every row. 每一行的大柱子'longtext'都是一样的。 Data frames in R are smart enough no to store duplicate values more than once (via factors). R中的数据帧足够智能,不能多次存储重复值(通过因子)。

So at the end we finish with a text file of 700MB versus a binary file of 335K (Of course binary is much faster xD) 所以最后我们完成了一个700MB的文本文件而不是一个335K的二进制文件(当然二进制文件的速度要快得多)

-rw-r--r-- 1 carlos carlos 335K Jun  4 08:46 test.RData
-rw-rw-r-- 1 carlos carlos 745M Jun  4 08:46 test.txt

However if we try with random data 但是,如果我们尝试随机数据

> longtext<-paste(sample(c(0:9, letters, LETTERS),1000*nchar('dqsdgfmqslkfdjiehsmlsdfkjqsefr'), replace=TRUE),collapse="")
> test$X3<-rep(longtext,26*tt)
> 
> system.time(write.table(test,file='test.txt'))
   user  system elapsed 
  2.119   0.476   4.723 
> system.time(save(test,file='test.RData'))
   user  system elapsed 
  0.229   0.879   3.069 

and files are not that different 和文件没有那么不同

-rw-r--r-- 1 carlos carlos 745M Jun  4 08:52 test.RData
-rw-rw-r-- 1 carlos carlos 745M Jun  4 08:52 test.txt

As you see, elapsed time is not the sum of user+system...so the disk is the bottleneck in both cases. 如您所见,经过的时间不是用户+系统的总和...因此磁盘是两种情况下的瓶颈。 Yes binary storing will always be faster since you don't have to include semicolon, quotes or staff like that, but just dumping memory object to disk. 是的二进制存储总是会更快,因为您不必包含分号,引号或类似的人员,而只是将内存对象转储到磁盘。

BUT there is always a point where disk becomes bottleneck. 但总有一点是磁盘成为瓶颈。 My test was ran in a research server where via NAS solution we get disk read/write times over 600MB/s. 我的测试是在一个研究服务器上运行的,通过NAS解决方案我们获得超过600MB / s的磁盘读/写时间。 If you do the same in your laptop, where is hard to go over 50MB/s, you'll note the difference. 如果你在笔记本电脑上做同样的事情,那么难以超过50MB / s,你就会注意到它们之间的区别。

So, if you actually have to deal with real bigData (and repeating one million times the same thousand character string is not big data), when the binary dump of the data is over 1 GB, you'll appreciate having a good disk (SSD is a good choice) for reading input data and writing results back to disk. 所以,如果你实际上必须处理真正的bigData(并且重复一百万次,相同的千字符串不是大数据),当数据的二进制转储超过1 GB时,你会欣赏有一个好的磁盘(SSD读取输入数据并将结果写回磁盘是一个很好的选择。

I have to second John's suggestion to profile your application. 我必须提出约翰的第二个建议来描述您的申请。 My experience is that it isn't the actual data reads that are the slow part, it's the overhead of creating the programming objects to contain the data, casting from strings, memory allocation, etc. 我的经验是,实际数据读取不是缓慢的部分,而是创建编程对象以包含数据,从字符串转换,内存分配等的开销。

I would strongly suggest you profile your code first, and consider using alternative libraries (like numpy) to see what improvements you can get before you invest in hardware. 我强烈建议您首先分析您的代码,并考虑使用替代库(如numpy)来查看在投资硬件之前可以获得哪些改进。

The read and write times for SSDs are significantly higher than standard 7200 RPM disks (it's still worth it with a 10k RPM disk, not sure how much of an improvement it is over a 15k). SSD的读取和写入时间明显高于标准的7200 RPM磁盘(使用10k RPM磁盘仍然是值得的,不知道它在15k以上有多大改进)。 So, yes, you'd get much faster times on data access. 所以,是的,你可以在数据访问上获得更快的时间。

The performance improvement is undeniable. 性能提升是不可否认的。 Then, it's a question of economics. 那么,这是一个经济学问题。 2TB 7200 RPM disks are $170 a piece, and 100GB SSDS cost $210. 2TB 7200 RPM磁盘每片170美元,100GB SSDS售价210美元。 So if you have a lot of data, you may run into a problem. 因此,如果您有大量数据,则可能会遇到问题。

If you read/write a lot of data, get an SSD. 如果您读取/写入大量数据,请获取SSD。 If the application is CPU intensive, however, you'd benefit much more from getting a better processor. 但是,如果应用程序是CPU密集型的,那么从获得更好的处理器中获益更多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM