简体   繁体   中英

Market Basket Analysis in R with Hadoop

I'm trying to find a fast way to do an affinity analysis on transactional market basket data with a few million number of rows.

What I've done so far:

  • Created an R Server on top of Spark & Hadoop on cloud (Azure HDInsight)
  • Loaded data on HDFS
  • Get started with RevoScaleR

However, I got stuck at the last step. As far as I understand, I won't be able to process the data with the use of a function that is not provided within RevoScaleR.

Here is the code for accessing the data on HDFS:

bigDataDirRoot <- "/basket" 
mySparkCluster <- RxSpark(consoleOutput=TRUE)
rxSetComputeContext(mySparkCluster)
hdfsFS <- RxHdfsFileSystem(hostName=myNameNode, port=myPort)
inputFile <-file.path(bigDataDirRoot,"gunluk")

So my infputFile is a CSV in an Azure Blob already created at /basket/gunluk

gunluk_data <- RxTextData(file = inputFile,returnDataFrame = TRUE,fileSystem = hdfsFS)

After running this, I am able to see the data using head(gunluk_data).

How can I manage to use gunluk_data with arules package functions. Is this possible?

If not, is it possible to process a CSV file that is in HDFS using regular R packages (ie arules) ?

在规则中,您可以使用read.transactions从文件中读取数据,并使用write.PMML来写出规则/项目集。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM