简体   繁体   English

迭代数组并在文件中搜索数组中的每个项目

[英]Iterating through an array and searching for each item in the array in a file

I don't know if I'm even asking this question the right way, but I want to search through a log file and look for each word in an array. 我不知道我是否以正确的方式提出这个问题,但我想搜索日志文件并查找数组中的每个单词。 At this point, I've asked the user to drag the file in question into terminal, then to build an array out of inputs. 此时,我已经要求用户将有问题的文件拖到终端中,然后用输入构建一个数组。 the program should print out every line a word is found in. 程序应该打印出一个单词的每一行。

Once I get that working I'll format, have a counter, or make a little summary of what I found in the file, etc. 一旦我开始工作,我将格式化,有一个计数器,或者对我在文件中找到的内容做一点总结等。

Here's what I've got so far, only when I run it, it doesn't actually find any words. 这是我到目前为止所得到的,只有当我运行它时,它实际上找不到任何单词。 I've been looking through re usage examples, but I think may be overly complicated for what I have in mind: 我一直在查看重复使用示例,但我认为对于我的想法可能过于复杂:

def wordsToFind():
    needsWords = True
    searchArray = []
    print "Add words to search ('done') to save/continue."
    while needsWords == True:
        word = raw_input("Enter a search word: ")
        if word.lower() == "done":
            needsWords = False
            break
        else:
            searchArray.append(word)
            print word + " added"
    return searchArray

def getFile():
    file_to_read = raw_input("Drag file here:").strip()
    return file_to_read

def main():
    filePath = getFile()
    searchArray = wordsToFind()
    print "Words searched for: ", searchArray
    searchCount = []

    with open(filePath, "r") as inFile:
        for line in inFile:
            for item in searchArray:
                if item in line:
                    print item


main()

Obviously, any suggestions to optimize or recommendations for better python-coding are strongly welcomed here, I only know what I know, and appreciate all the help! 显然,任何优化或建议更好的python编码的建议都在这里受到欢迎,我只知道我所知道的,并感谢所有的帮助!

This is exactly the kind of problem that map-reduce is intended to solve. 这正是map-reduce旨在解决的问题。 In case you are not familiar, map-reduce is a simple, two step process. 如果您不熟悉,map-reduce是一个简单的两步过程。 Suppose you have a list storing the words you are interested in finding on a text. 假设您有一个列表,用于存储您有兴趣在文本中找到的单词。 Your mapper function can iterate through this list of words, for each line of the text, and if it appears in the line, it returns a value, say, ['word', lineNum] which is stored in a results list. 您的映射器函数可以遍历此单词列表,对于文本的每一行,如果它出现在行中,它将返回一个值,例如,['word',lineNum],它存储在结果列表中。 The mapper is essentially a wrapper over a for loop. 映射器本质上是for循环的包装器。 You can then take your results list and "reduce" it, by writing a reducer function which in this case, could take the results list which should look like [['word1', 1]...['word1', n]...] into an object that looks like {'word1': [1, 2, 5], 'word3': [7], ...}. 然后你可以通过编写一个reducer函数来获取你的结果列表并“减少”它,在这种情况下,它可以取结果列表,它应该看起来像[['word1',1] ... ['word1',n] ...]到一个看起来像{'word1'的对象:[1,2,5],'word3':[7],...}。

This approach is advantageous because you abstract the process of iterating over lists while performing a common action to each item, and should your analysis needs change (as they do often), you only need to change your mapper/reducing functions without touching the rest of the code. 这种方法是有利的,因为您在对每个项目执行常见操作时抽象迭代列表的过程,并且如果您的分析需要更改(如他们经常做的那样),您只需要更改映射器/缩减功能而不触及其余部分编码。 Additionally, this method is highly parallelizable, should it ever become an issue (just ask Google!). 此外,这种方法可高度并行化,如果它成为一个问题(只要问谷歌!)。

Python 3.x has built-in map/reduce methods as map() and reduce(); Python 3.x有内置的map / reduce方法,如map()和reduce(); look them up in the python docs. 在python文档中查找它们。 So you can see how they work, I implemented a version of map/reduce based on your problem without using the built-in libraries. 所以你可以看到它们是如何工作的,我在不使用内置库的情况下根据你的问题实现了map / reduce版本。 Since you didn't specify how your data was stored, I made a couple of assumptions about it, namely that the list of words of interest was to be given as a comma-separated file. 由于您未指定数据的存储方式,因此我对其进行了一些假设,即感兴趣的单词列表将以逗号分隔文件的形式给出。 To read the text files, I used readlines() to get an array of lines, and a regular expressions pattern to split the lines into words (namely, split on anything that isnt an alphanumerical character). 为了读取文本文件,我使用readlines()来获取行数组,并使用正则表达式模式将行拆分为单词(即,拆分任何不是字母数字的字符)。 Of course, this might not suit your needs, so you can change this to whatever make sense for the files you're looking at. 当然,这可能不适合您的需求,因此您可以将其更改为对您正在查看的文件有意义的内容。

I tried to stay away from the esoteric python features (no lambdas!), so hopefully the implementation is clear. 我试图远离深奥的python功能(没有lambdas!),所以希望实现是明确的。 One last note, I used a loop to iterate over the lines of the text file, and a map function to iterate over list of words of interest. 最后一点,我使用循环迭代文本文件的行,并使用map函数迭代感兴趣的单词列表。 You could use nested map functions instead, but I wanted to keep track of the loop index (since you care about line numbers). 您可以使用嵌套的地图函数,但我想跟踪循环索引(因为您关心行号)。 If you really want to nest the map functions, you could store your array lines as a tuple of line and line number when you read the file, or you can modify the map function to return the index, your choice. 如果你真的想嵌套地图函数,你可以在读取文件时将数组行存储为行和行号的元组,或者你可以修改map函数以返回索引,你的选择。

I hope this helps! 我希望这有帮助!

    #!usr/bin/env/ python

    #Regexp library
    import re

    #Map
    #This function returns a new array containing
    #the elements after that have been modified by whatever function we passed in.
    def mapper(function, sequence):

        #List to store the results of the map operation
        result = []

        #Iterate over each item in sequence, append the values to the results list
        #after they have been modified by the "function" supplied as an argument in the
        #mapper function call.
        for item in sequence:
            result.append(function(item))

        return result

    #Reduce
    #The purpose of the reduce function is to go through an array, and combine the items 
    #according to a specified function - this specified function should combine an element 
    #with a base value
    def reducer(function, sequence, base_value):

        #Need to get an base value to serve as the starting point for the construction of 
        #the result
        #I will assume one is given, but in most cases you should include extra validation 
        #here to either ensure one is given, or some sensible default is chosen

        #Initialize our accumulative value object with the base value
        accum_value = base_value

        #Iterate through the sequence items, applying the "function" provided, and 
        #storing the results in the accum_value object
        for item in sequence:
            accum_value = function(item, accum_value)

        return accum_value

    #With these functions it should be sufficient to address your problem, what remains 
    #is simply to get the data from the text files, and keep track of the lines in 
    #which words appear
    if __name__ == 'main':

        word_list_file = 'FILEPATH GOES HERE'

        #Read in a file containing the words that will be searched in the text file 
        #(assumes words are given as a comma separated list)
        infile = open(word_list_file, 'rt')    #Open file
        content = infile.read()     #read the whole file as a single string
        word_list = content.split(',')  #split the string into an array of words
        infile.close()

        target_text_file = 'FILEPATH GOES HERE'

        #Read in the text to analyze
        infile = open(target_text_file, 'rt')   #Open file
        target_text_lines = infile.readlines()    #Read the whole file as an array of lines
        infile.close()

        #With the data loaded, the overall strategy will be to loop over the text lines, and 
        #we will use the map function to loop over the the word_list and see if they are in 
        #the current text file line

        #First, define the my_mapper function that will process your data, and will be passed to
        #the map function
        def my_mapper(item):

            #Split the current sentence into words
            #Will split on any non alpha-numeric character. This strategy can be revised 
            #to find matches to a regular expression pattern based on the words in the 
            #words list. Either way, make sure you choose a sensible strategy to do this.
            current_line_words = re.split(r'\W+', target_text_lines[k])

            #lowercase the words
            current_line_words = [word.lower() for word in current_line_words]

            #Check if the current item (word) is in the current_line_words list, and if so,
            #return the word and the line number
            if item in current_line_words:
                return [item, k+1]    #Return k+1 because k begins at 0, but I assume line
                                      #counting begins with 1?
            else:
                return []   #Technically, this does not need to be added, it can simply 
                            #return None by default, but that requires manually handling iterator 
                            #objects so the loop doesn't crash when seeing the None values, 
                            #and I am being lazy :D

        #With the mapper function established, we can proceed to  loop over the text lines of the 
        #array, and use our map function to process the lines against the list of words.

        #This array will store the results of the map operation
        map_output = []

        #Loop over text file lines, use mapper to find which words are in which lines, store 
        #in map_output list. This is the exciting stuff!
        for k in range(len(target_text_lines)):
            map_output.extend(mapper(my_mapper, word_list))

        #At this point, we should have a list of lists containing the words and the lines they 
        #appeared in, and it should look like, [['word1', 1] ... ['word25': 5] ... [] ...]
        #As you can see, the post-map array will have an entry for each word that appeared in 
        #each line, and if a particular word did not appear in a particular line, there will be a
        #empty list instead.

        #Now all that remains is to summarize our data, and that is what the reduce function is 
        #for. We will iterate over the map_output list, and collect the words and which lines 
        #they appear at in an object that will have the format { 'word': [n1, n2, ...] },where 
        #n1, n2, ... are the lines the word appears in. As in the case for the mapper
        #function, the output of the reduce function can be modified in the my_reducer function 
        #you supply to it. If you'd rather it return something else (like say, word count), this
        #is the function to modify.

        def my_reducer(item, accum_value):
            #First, verify item is not empty
            if item != []:
                #If the element already exists in the output object, append the current line 
                #value to it, if not, add it to the object and create a set holding the current 
                #line value

                #Check this word/line combination isn't already stored in the output dict
                if (item[0] in accum_value) and (item[1] not in accum_value[item[0]]):
                    accum_value[item[0]].append(item[1])
                else:
                    accum_value[item[0]] = [item[1]]

            return accum_value

        #Now we can call the reduce function, save it's output, print it to screen, and we're  
        #done!
        #(Note that for base value we are just passing in an empty object, {})
        reduce_results = reducer(my_reducer, map_output, {})

        #Print results to screen
        for result in reduce_results:
            print('word: {}, lines: {}'.format(result, reduce_results[result]))

You can do this way: 你可以这样做:

a = ['foo', 'bar', 'cox', 'less', 'more']
b = ['foo', 'cox', 'complex', 'list']
c = list(set(a).intersection(set(b)))

In this way c will be: 这样c将是:

['cox', 'foo']

Another way to accomplish this is using python comprehension: 另一种实现此目的的方法是使用python理解:

c = [x for x in a if x in b]

I din't test which is the speedest way, but I think is using sets... 我不测试哪种方式最出色,但我认为是使用套装......

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM