简体   繁体   English

R中的多个模式匹配多个文件,多个列和行

[英]Multiple Pattern matching in R over multiple files , multiple columns & rows

I've a list of CSV files i need to read from , in which multiple files with columns such as Title, description .... . 我有一个我需要读取的CSV文件列表,其中包含标题,描述等列的多个文件。 From these columns over multiple files , a retrieval operation has to be written and matched against another CSV generated from popular keywords(~10k) generated from a tool similar to WordStream SEO. 从多个文件的这些列开始,必须编写检索操作并将其与从类似于WordStream SEO的工具生成的热门关键字(~10k)生成的另一个CSV进行匹配。

What i was able to do 我能做什么

#Not sure if this is correct approach
              Source1<- read.csv(path to csv file)
                Keywords_tomatch<- read.csv(path to csv file)

            #cant really take both the columns into single vector and iterate over them

                    subColdesc <- Source1[,c(3)]
                    subcolTitle <-Source1[,c(2)]
                   keywordget<- subset(Keywords_tomatch,grepl("*",Keywords_tomatch$col1))

    #Two individual vectors since i'm not sure whether sapply() can be applied over multiple lists     Definition: sapply(list,function)

            descBoolean <- sapply(keywordget, 
                                      function(y) 
                                        sapply(subColdesc , 
                                               function(x) 
                                                 any(grepl(y,x))) 
                               )

           TitleBoolean = sapply(keywordget, 
                                  function(y)
                                    sapply(subcolTitle , 
                                        function(x)
                                          any(grepl(y,x)))
                              )

#matches just the first element in the column of keywordget against (~4k) elements in description,title column. i.e returns a warning/error 

In grepl(y, x) : argument 'pattern' has length > 1 and only the first element will be used 在grepl(y,x)中:参数'pattern'的长度> 1,只使用第一个元素

I've tried at Akrun's version of grep and it hadn't worked for me 我已经尝试过Akrun的grep版本,它对没用

Question : 题 :

How to match all the elements in the keywordget vector and retrieve what columns matched on each row of Description,Title and what rows of Description and Title have matched. 如何匹配关键字算法向量中的所有元素,并检索描述,标题以及描述和标题的哪些行匹配的每一行上匹配的列。

In short how to retrieve all the game related products in the Source1 using Keywords_tomatch? 简而言之,如何使用Keywords_tomatch检索Source1中的所有游戏相关产品?

As a sample i'm posting the two files i've gathered. 作为一个例子,我发布了我收集的两个文件。 Source1 only contains few rows of 4k rows Source1仅包含少量4k行

Source1 = 1.csv , Keywords_tomatch = Gaming.csv Source1 = 1.csv ,Keywords_tomatch = Gaming.csv

First, let me point out some possible ways why your code is not working (and correct me if I am wrong): 首先,让我指出一些可能的方法,为什么你的代码不工作(并纠正我,如果我错了):

  • Your files are read with stringAsFactors = TRUE , and grepl does recognize factor variables for the pattern = argument. 您的文件使用stringAsFactors = TRUE读取, grepl确实识别pattern =参数的因子变量。 But since you did not get an error about grepl not recognizing factors, I assume you converted them to characters before matching. 但是因为你没有得到关于grepl没有识别因素的错误,我假设你在匹配之前将它们转换为字符。

  • You need the fixed = TRUE argument for grepl or else it will treat the elements of keys as regular expressions. 您需要greplfixed = TRUE参数,否则它会将键元素视为正则表达式。

  • Your keywordget is a dataframe, and R treats dataframes as lists when being called as one. 您的keywordget是一个数据框,R在被调用为一个时将数据框视为列表。 So since the first argument of sapply takes a list, it treats keywordget as a list with 1 element. 因此,由于sapply的第一个参数采用列表,因此它将keywordget视为包含1个元素的列表。 So when this element (which is essentially the entire vector of keywordget) is supplied to the pattern argument of the grepl function, you get the error: In grepl(y, x) : argument 'pattern' has length > 1 and only the first element will be used 因此,当这个元素(实际上是keywordget的整个向量)被提供给grepl函数的pattern参数时,你会得到错误: 在grepl(y,x)中:参数'pattern'的长度> 1且只有第一个元素将被使用

For example, this should work: 例如,这应该工作:

sapply(keywordget$GAMING, function(y) {
  sapply(source1$title, function(x) {
    any(grepl(y,x, fixed = TRUE))
  })
})

Below is my solution: 以下是我的解决方案:

# Read files
source1 = read.csv("source1.csv", stringsAsFactors = FALSE)
keys = read.csv("gaming.csv", stringsAsFactors = FALSE)

# Finds the index of elements in source1 that matches 
# with any of the keys
matchIndex = lapply(source1, function(x){
  which(Reduce(`|`, lapply(keys$GAMING, grepl, x, fixed = TRUE)))
})

> matchIndex
$title
integer(0)

$description
[1] 189 293 382 402 456

title has zero matches and description has 5 title没有匹配,描述有5

# Returns the descriptions that match
source1$description[matchIndex$description]

# Returns the title corresponding to the descriptions that match
source1$title[matchIndex$description]

> source1$title[matchIndex$description]
[1] "tomb raider: legend"                     
[2] "namco museum 50th anniversary collection"
[3] "restricted area"                         
[4] "south park chef's luv shack"             
[5] "brainfood games cranium collection 2006" 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM