简体   繁体   English

R - 使用自定义函数进行高效数据清理

[英]R - Efficient data cleansing with custom functions

I am working on cleaning a dataset composed of 1M names.我正在清理由 1M 个名称组成的数据集。 The cleaning is done by a function that includes around 40 greps such as name=gsub("Johnmichael", "John Michael",name,ignore.case=TRUE) and name=gsub("Mihcael", "Michael",name,ignore.case=TRUE)清理由 function 完成,其中包括大约 40 个 grep,例如name=gsub("Johnmichael", "John Michael",name,ignore.case=TRUE)name=gsub("Mihcael", "Michael",name,ignore.case=TRUE)

I am currently using the cleaning function straight-up like this:我目前正在像这样直接使用清洁 function :

contacts$first_name=clean_name(contacts$first_name)

My issue is that my code is very slow since it applies the function to the whole vector one at a time.我的问题是我的代码非常慢,因为它一次将 function 应用于整个向量。 I am trying to find a way to use the function in parallel for each string, I have tried sapply but I do not seem to find any improvements.我正在尝试找到一种方法来为每个字符串并行使用 function,我已经尝试过 sapply 但我似乎没有发现任何改进。 Any advice?有什么建议吗?

2 2

Install the OpenBLAS in R in Windows x64在 Windows x64 中安装 R 中的 OpenBLAS

Open the url http://sourceforge.net/projects/openblas/files/打开 url http://sourceforge.net/projects/openblas/files/

Open the the latest version folder打开最新版本文件夹

download OpenBLAS-v0.2.13-Win64-int32.zip and mingw64_dll.zip下载 OpenBLAS-v0.2.13-Win64-int32.zip 和 mingw64_dll.zip

Unpack the "OpenBLAS-v0.2.13-Win64-int32.zip" find "libopenblas.dll" and rename this file to "Rblas.dll",copy the file to the path like this "\R\R-3.1.2\bin\x64"(Remember to backup) Unpack the "mingw64_dll.zip" and copy all the DLL to the same path "\R\R-3.1.2\bin\x64"解压“OpenBLAS-v0.2.13-Win64-int32.zip”找到“libopenblas.dll”并将此文件重命名为“Rblas.dll”,将文件复制到这样的路径“\R\R-3.1.2\ bin\x64”(记得备份)解压“mingw64_dll.zip”,将DLL全部复制到同一路径“\R\R-3.1.2\bin\x64”

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM