R中的多个模式匹配多个文件，多个列和行

Question

我有一个我需要读取的CSV文件列表，其中包含标题，描述等列的多个文件。 从多个文件的这些列开始，必须编写检索操作并将其与从类似于WordStream SEO的工具生成的热门关键字（~10k）生成的另一个CSV进行匹配。

我能做什么

#Not sure if this is correct approach
              Source1<- read.csv(path to csv file)
                Keywords_tomatch<- read.csv(path to csv file)

            #cant really take both the columns into single vector and iterate over them

                    subColdesc <- Source1[,c(3)]
                    subcolTitle <-Source1[,c(2)]
                   keywordget<- subset(Keywords_tomatch,grepl("*",Keywords_tomatch$col1))

    #Two individual vectors since i'm not sure whether sapply() can be applied over multiple lists     Definition: sapply(list,function)

            descBoolean <- sapply(keywordget, 
                                      function(y) 
                                        sapply(subColdesc , 
                                               function(x) 
                                                 any(grepl(y,x))) 
                               )

           TitleBoolean = sapply(keywordget, 
                                  function(y)
                                    sapply(subcolTitle , 
                                        function(x)
                                          any(grepl(y,x)))
                              )

#matches just the first element in the column of keywordget against (~4k) elements in description,title column. i.e returns a warning/error

在grepl（y，x）中：参数'pattern'的长度> 1，只使用第一个元素

我已经尝试过Akrun的grep版本，它对我没用

题：

如何匹配关键字算法向量中的所有元素，并检索描述，标题以及描述和标题的哪些行匹配的每一行上匹配的列。

简而言之，如何使用Keywords_tomatch检索Source1中的所有游戏相关产品？

作为一个例子，我发布了我收集的两个文件。 Source1仅包含少量4k行

Source1 = 1.csv ，Keywords_tomatch = Gaming.csv

Answer 1

首先，让我指出一些可能的方法，为什么你的代码不工作（并纠正我，如果我错了）：

您的文件使用stringAsFactors = TRUE读取， grepl确实识别pattern =参数的因子变量。 但是因为你没有得到关于grepl没有识别因素的错误，我假设你在匹配之前将它们转换为字符。
您需要grepl的fixed = TRUE参数，否则它会将键元素视为正则表达式。
您的keywordget是一个数据框，R在被调用为一个时将数据框视为列表。 因此，由于sapply的第一个参数采用列表，因此它将keywordget视为包含1个元素的列表。 因此，当这个元素（实际上是keywordget的整个向量）被提供给grepl函数的pattern参数时，你会得到错误： 在grepl（y，x）中：参数'pattern'的长度> 1且只有第一个元素将被使用

例如，这应该工作：

sapply(keywordget$GAMING, function(y) {
  sapply(source1$title, function(x) {
    any(grepl(y,x, fixed = TRUE))
  })
})

以下是我的解决方案：

# Read files
source1 = read.csv("source1.csv", stringsAsFactors = FALSE)
keys = read.csv("gaming.csv", stringsAsFactors = FALSE)

# Finds the index of elements in source1 that matches 
# with any of the keys
matchIndex = lapply(source1, function(x){
  which(Reduce(`|`, lapply(keys$GAMING, grepl, x, fixed = TRUE)))
})

> matchIndex
$title
integer(0)

$description
[1] 189 293 382 402 456

title没有匹配，描述有5

# Returns the descriptions that match
source1$description[matchIndex$description]

# Returns the title corresponding to the descriptions that match
source1$title[matchIndex$description]

> source1$title[matchIndex$description]
[1] "tomb raider: legend"                     
[2] "namco museum 50th anniversary collection"
[3] "restricted area"                         
[4] "south park chef's luv shack"             
[5] "brainfood games cranium collection 2006"

R中的多个模式匹配多个文件，多个列和行

问题描述

1 个解决方案

解决方案1
0 2016-10-21 19:31:45

R中的多个模式匹配多个文件，多个列和行

问题描述

1 个解决方案

解决方案1 0 2016-10-21 19:31:45

解决方案1
0 2016-10-21 19:31:45