简体   繁体   English

R中的规则序列挖掘

[英]Arules Sequence Mining in R

I am looking to use the arulesSequences package in R. However, I have no idea as to how to coerce my data frame into an object that can leverage this package.我希望在 R 中使用arulesSequences包。但是,我不知道如何将我的数据帧强制转换为可以利用该包的对象。

Here is a toy dataset that replicates my data structure:这是一个复制我的数据结构的玩具数据集:

ids <- c(rep("X", 5), rep("Y", 5), rep("Z", 5))
seq <- rep(1:5,3)
val <- sample(LETTERS, 15, replace=T)
df <- data.frame(ids, seq, val)
df

   ids seq val
1    X   1   T
2    X   2   H
3    X   3   V
4    X   4   A
5    X   5   X
6    Y   1   D
7    Y   2   B
8    Y   3   A
9    Y   4   D
10   Y   5   P
11   Z   1   Q
12   Z   2   R
13   Z   3   W
14   Z   4   W
15   Z   5   P

Any help will be greatly appreciated.任何帮助将不胜感激。

Factor data frame:因子数据框:

df_fact = data.frame(lapply(df,as.factor))

Build "transaction" data:构建“交易”数据:

df_trans = as(df_fact, 'transactions')

Test it:测试一下:

itemFrequencyPlot(df_trans, support = 0.1, cex.names=0.8)

By using read_baskets:通过使用 read_baskets:

    read_baskets(con  = filePath.txt,
      sep = " ",
      info = c("sequenceID","eventID","SIZE"))

Which in practice means exporting the created data to a text-file and re-importing it through read_baskets.这实际上意味着将创建的数据导出到文本文件并通过 read_baskets 重新导入。 The info argument defines the first columns containing the sequenceID, eventID and an optional eventsize column. info 参数定义包含 sequenceID、eventID 和可选的 eventsize 列的第一列。

It worked for me add an essentially "order" column that lists a order ranking rather than a time value.它对我有用,添加了一个本质上是“订单”列,其中列出了订单排名而不是时间值。 You just have to be very specific in the naming convention.您只需要在命名约定中非常具体。 Try and name the "group" or "ordered basket #" variable sequenceID, and call the ranking or ordering eventID.尝试命名“group”或“ordered bag#”变量sequenceID,并调用排序或排序eventID。

Another thing that helped me (and had me scratching my head for a long time) was that read_baskets() seemed to need me to specify另一件帮助我(并让我挠了很长时间)的事情是 read_baskets() 似乎需要我指定

read_baskets(con  = filePath.txt, sep = " ", info = c("sequenceID","eventID","SIZE"))

Even though the help function makes the c() details seem like an optional header, it is not.尽管帮助函数使 c() 细节看起来像一个可选的标题,但它不是。 I seemed to need to remove the header from my file and specify it in the read_baskets() command, or I'd run into problems.我似乎需要从我的文件中删除标题并在 read_baskets() 命令中指定它,否则我会遇到问题。

Instead of using the data frame, what worked best for me was to split the data into individual and than convert to transactions.不使用数据框,对我来说最有效的是将数据拆分为单个数据,而不是转换为事务。

 eh$cost<-split(eh$cost$val ,eh$cost$id)
 eh$cost1<- as(eh$cost,"transactions")

You have to first change your items into transactions so just coerce the column of items您必须首先将您的项目更改为交易,因此只需强制项目列
trans = as(df[,'val'], "transactions")

then you can add the information to your transactions object然后您可以将信息添加到您的交易对象

trans@itemsetInfo$transactionID = NULL trans@itemsetInfo$sequenceID = df$ids trans@itemsetInfo$eventID = df$seq

df <- df %>% arrange(id,seq) %>% summarise(size=n(), items=list(val))

then write to txt ( this tutorial also suggest that after a data wrangling write it then read it with read_basket function)然后写入txt( 本教程还建议在数据read_basket后写入然后使用read_basket函数读取它)

df$items <- as.character(df$items)
write.table(df, file = "trans.txt", sep = " ", row.names = FALSE, col.names = FALSE)

read the file and check it读取文件并检查它

x <- read_baskets("trans.txt", sep = " ", info = c("sequenceID","eventID","SIZE"))
as(x, "data.frame")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM