简体   繁体   中英

Python 3 - Text file splitting by word, counting occurences and returning a list of sorted tuples

I've already made a post about this but since then I managed to solve the issues I had at first since just editing the old question just makes things more complicated.

I have a text file with about 10'000 words. The output of the function should be: a list of tuples that have the word and the amount of occurences of that word as a tuple in descending order. For example: out = [("word1",10),("word3",8),("word2",5)...]

So this is my code so far: (Keep in mind, this does work currently to a certain extent it is just extremely inefficient)

def text(inp):
    with open(inp,"r") as file:
        content = file.readlines()
        delimiters = ["\n"," ",",",".","?","!",":",";","-"]
        words = content
        spaces = ["","'",'']
        out = []
       
        for delimiter in delimiters:
            new_words = []
            
            for word in words:
                if word in spaces:
                    continue
                new_words += word.split(delimiter)
            words = new_words
  
        for word in words:
            x = (words.count(word),word)
            out.append(x)
            
    return out

I found some help from older posts on Stackoverflow for the first few lines. The input should be the file path. This does work in my case. The first part (the lines I found on here) work nicely. Although there are elements in the list such as empty strings. My questions now are:

How can I sort the output such that the word with the most occurences comes first and then from that on in a descending order? Currently it is random. Also, I'm not sure if the same word comes up multiple times in this list. If yes, how can I make it such that it only occurs once in the output?

Also, How can I make this code more efficient? I used time.time() to check and it took almost 419 seconds, which obviously is terribly inefficient, since the task stated that it should take less than 30sec.

I apologize in advance for any mistakes I made and my lack of knowledge on this

instead of running so many loops and conditions. you can use re.split

import re
def text(inp):
    with open(inp,"r") as file:
        content = file.readlines()
        #delimiters = ["\n"," ",",",".","?","!",":",";","-"]
        words = content
        spaces = ["","'",'']
        out = []
        temp_list=[]
        for word in words:
            #using re.split
            temp_list.extend(re.split('[\n, .?!:;-]',word))
        for word in set(temp_list):
            x = (word,temp_list.count(word))
            out.append(x)
    return out

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM