简体   繁体   中英

Multiple Word Search not Working Correctly (Python)

I am working on a project that requires me to be able to search for multiple keywords in a file. For example, if I had a file with 100 occurrences of the word "Tomato", 500 for the word "Bread", and 20 for "Pickle", I would want to be able to search the file for "Tomato" and "Bread" and get the number of times it occurs in the file. I was able to find people with the same issue/question, but for other languages on this site.

I a working program that allows me to search for the column name and tally how many times something shows up in that column, but I want to make something a bit more precise. Here is my code:

def start():
    location = raw_input("What is the folder containing the data you like processed located? ")
    #location = "C:/Code/Samples/Dates/2015-06-07/Large-Scale Data Parsing/Data Files"
    if os.path.exists(location) == True: #Tests to see if user entered a valid path
        file_extension = raw_input("What is the file type (.txt for example)? ")
        search_for(location,file_extension)
    else:
        print "I'm sorry, but the file location you have entered does not exist. Please try again."
        start()

def search_for(location,file_extension):
    querylist = []
    n = 5
    while n == 5:
        search_query = raw_input("What would you like to search for in each file? Use'Done' to indicate that you have finished your request. ")
        #list = ["CD90-N5722-15C", "CD90-NB810-4C", "CP90-N2475-8", "CD90-VN530-22B"]
        if search_query == "Done":
            print "Your queries are:",querylist
            print ""
            content = os.listdir(location)
            run(content,file_extension,location,querylist)
            n = 0
        else:
            querylist.append(search_query)
            continue


def run(content,file_extension,location,querylist):
    for item in content:
        if item.endswith(file_extension):
            search(location,item,querylist)
    quit()

def search(location,item,querylist):
    with open(os.path.join(location,item), 'r') as f:
        countlist = []
        for search in querylist: #any search value after the first one is incorrectly reporting "0"
            countsearch = 0
            for line in f:
                if search in line:
                    countsearch = countsearch + 1
            countlist.append(search)
            countlist.append(countsearch) #mechanism to update countsearch is not working for any value after the first
        print item, countlist

start()

If I use that code, the last part (def search) is not working correctly. Any time I put a search in, any search after the first one I enter in returns "0", despite there being up to 500,000 occurrences of the search word in a file.

I was also wondering, since I have to index 5 files with 1,000,000 lines each, if there was a way I could write either an additional function or something to count how many times "Lettuce" occurs over all the files.

I cannot post the files here due to their size and content. Any help would be greatly appreciated.

Edit

I also have this piece of code here. If I use this, I get the correct count of each, but it would be much better to have a user be able to enter as many searches as they want:

def check_start():
    #location = raw_input("What is the folder containing the data you like processed located? ")
    location = "C:/Code/Samples/Dates/2015-06-07/Large-Scale Data Parsing/Data Files"
    content = os.listdir(location)
    for item in content:
        if item.endswith("processed"):
             countcol1 = 0
             countcol2 = 0
             countcol3 = 0
             countcol4 = 0
             #print os.path.join(currentdir,item)
             with open(os.path.join(location,item), 'r') as f:
                  for line in f:
                      if "CD90-N5722-15C" in line:
                          countcol1 = countcol1 + 1
                      if "CD90-NB810-4C" in line:
                          countcol2 = countcol2 + 1
                      if "CP90-N2475-8" in line:
                          countcol3 = countcol3 + 1
                      if "CD90-VN530-22B" in line:
                          countcol4 = countcol4 + 1
             print item, "CD90-N5722-15C", countcol1, "CD90-NB810-4C", countcol2, "CP90-N2475-8", countcol3, "CD90-VN530-22B", countcol4

You are trying to iterate over your file more than once. After the first time, the file pointer is at the end so subsequent searches will fail because there's nothing left to read.

If you add the line:

f.seek(0) , this will reset the pointer before every read:

def search(location,item,querylist):
    with open(os.path.join(location,item), 'r') as f:
        countlist = []
        for search in querylist: #any search value after the first one is incorrectly reporting "0"
            countsearch = 0
            for line in f:
                if search in line:
                    countsearch = countsearch + 1
            countlist.append(search)
            countlist.append(countsearch) #mechanism to update countsearch is not working for any value after the first
            f.seek(0)
    print item, countlist

PS. I've guessed at the indentation... You really shouldn't use tabs.

I'm not sure I get your question completely, but how about something like this?

def check_start():

    raw_search_terms = raw_input('Enter search terms seperated by a comma:')
    search_term_list = raw_search_terms.split(',')

    #location = raw_input("What is the folder containing the data you like processed located? ")
    location = "C:/Code/Samples/Dates/2015-06-07/Large-Scale Data Parsing/Data Files"
    content = os.listdir(location)

    for item in content:
        if item.endswith("processed"):
            # create a dictionary of search terms with their counts (initialized to 0)
            search_term_count_dict = dict(zip(search_term_list, [0 for s in search_term_list]))

            for line in f:
                for s in search_term_list:
                    if s in line:
                        search_term_count_dict[s] += 1



        print item
        for key, value in search_term_count_dict.iteritems() :
            print key, value

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM