简体   繁体   English

连接到外部硬盘驱动器上的Monetdblite / RSQLite数据库时的Dplyr速度?

[英]Dplyr speed when connected to Monetdblite/RSQLite database on external hard drive?

A beginner's question. 初学者的问题。

I'm using R with dplyr to analyse large amounts of data but I don't have access to a server-based database. 我将R与dplyr一起使用以分析大量数据,但无权访问基于服务器的数据库。 In addition, my computer's internal hard drive is too small for the databases that I need to create. 此外,计算机的内部硬盘驱动器对于需要创建的数据库来说太小了。 I have been using monetdblite and RSQLite to store the data so far. monetdblite ,我一直在使用monetdbliteRSQLite来存储数据。

Q : How much does the speed of monetdblite / RSQLite decrease in case I save the databases on an external hard drive and connect that to the computer via usb? :如果将数据库保存在外部硬盘驱动器上并通过USB将其连接到计算机, monetdblite / RSQLite的速度RSQLite降低多少? What factors determine how feasible this is? 哪些因素决定了这种方法的可行性?

Or is there a better alternative approach (still relying on dplyr 's database connectivity) in my situation? 还是我的情况下有更好的替代方法(仍然依赖dplyr的数据库连接性)?

Its really hard to tell whether the external drive is slower. 很难确定外置驱动器的速度是否较慢。 For example, if the internal drive is a SSD and the external one a classical "spinning disk", a performance drop is more or less to be expected, especially when using complex queries. 例如,如果内部驱动器是SSD,外部驱动器是传统的“旋转磁盘”,则性能下降或多或少是可以预期的,尤其是在使用复杂查询时。 I suggest you simply try with a reasonably sized database and your queries on both disks. 我建议您只是尝试使用大小合理的数据库,并在两个磁盘上进行查询。 There are also various disk performance checking tools (eg XBench on OSX) that you could use to check performance. 您还可以使用各种磁盘性能检查工具(例如OSX上的XBench)来检查性能。 The interesting metrics to look for here are sequential scan speed and random access speed. 在这里寻找的有趣指标是顺序扫描速度和随机访问速度。

I use monetDBLite to load large datasets into Rstudio. 我使用monetDBLite将大型数据集加载到Rstudio中。 For security reasons, I have an external SSD with USB 3.0, but my built-in hard drive is also an SSD. 出于安全考虑,我有一个带有USB 3.0的外部SSD,但是我的内置硬盘驱动器也是一个SSD。 I've used it for a few months, and my experience is summarized in the following query: 我已经使用了几个月,下面的查询总结了我的经验:

SELECT * FROM drug_db WHERE atc='L02BX03' OR atc='L02BB04'; 选择* FROM drug_db WHERE atc ='L02BX03'或atc ='L02BB04';

On built in: < 2 seconds, 内置时间:<2秒,

On external: 6-7 minutes 在外部:6-7分钟

The query scans through a ~15 Gb database and returns ~ 30 000 rows of 14 variables. 该查询将扫描约15 Gb的数据库,并返回约3万行的14个变量。 In my experience, it's actually much quicker to copy the file to the built in drive, and run the queries there, compared to running the queries against the external SSD. 根据我的经验,与对外部SSD进行查询相比,将文件复制到内置驱动器并在其中运行查询实际上要快得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM