繁体   English   中英

如何使用我的脚本删除无趣的单词和字符?

[英]How do I remove the uninteresting words and characters with my script?

我无法弄清楚我在这里做错了什么。 这只是我项目的一部分,我试图在项目的最后一部分排除标点符号和 uninteresting_words。 我可以完整运行我的脚本,但它不会删除标点符号或 uninteresting_words。 我已经尝试将标点符号变成一个列表,但它不是一个将内容分成单个项目的列表,它现在只是一个列表,其中的所有字符都作为一个列表项。 正如您在下面的代码中看到的那样,我尝试将punctuations.split()保存为一个名为 char 的新变量,并尝试了几种 if 循环和迭代方法来处理 file_contents 中的单词


def calculate_frequencies(file_contents):   # file_contents is being passed in through another 
                                            # part of the code that comes before this def
    # Here is a list of punctuations and uninteresting words you can use to process your text
    punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
    uninteresting_words = ["the", "a", "to", "if", "is", "it", "of", "and", "or", "an", "as", "i", "me", "my", \
    "we", "our", "ours", "you", "your", "yours", "he", "she", "him", "his", "her", "hers", "its", "they", "them", \
    "their", "what", "which", "who", "whom", "this", "that", "am", "are", "was", "were", "be", "been", "being", \
    "have", "has", "had", "do", "does", "did", "but", "at", "by", "with", "from", "here", "when", "where", "how", \
    "all", "any", "both", "each", "few", "more", "some", "such", "no", "nor", "too", "very", "can", "will", "just"]
    
    # LEARNER CODE START HERE
    char = punctuations.split()
    result = {}
    for words in file_contents.split():
      if words == uninteresting_words:
        pass
      if words.isalnum() and words != uninteresting_words:
        if words not in result:
            result[words]=1
        else:
            result[words]+=1
            
    print(result) # this line and the following 2 are just so i can see what how they show up
    print(char)
    print(uninteresting_words)
    
    
    #wordcloud-this part and after is ok and is working as expected with the code that follows 
    cloud = wordcloud.WordCloud()
    cloud.generate_from_frequencies(result)
    return cloud.to_array()

正如评论所说,您应该if words in uninteresting_words:

无论如何,我认为您的输入文本不会在标点符号上的特殊字符上分裂。 list.split() 默认在空格上分割。 使用words.strip(punctuation)一起删除标点符号。

您也不应该对字符串使用文档字符串(''',三引号)。 使用 ' 或 " 并根据需要转义其他字符。


def calculate_frequencies(file_contents):   # file_contents is being passed in through another 
                                            # part of the code that comes before this def
    # Here is a list of punctuations and uninteresting words you can use to process your text
    punctuations = "!()-[]{};:'\"\\,<>./?@#$%^&*_~"
    uninteresting_words = ["the", "a", "to", "if", "is", "it", "of", "and", "or", "an", "as", "i", "me", "my", \
    "we", "our", "ours", "you", "your", "yours", "he", "she", "him", "his", "her", "hers", "its", "they", "them", \
    "their", "what", "which", "who", "whom", "this", "that", "am", "are", "was", "were", "be", "been", "being", \
    "have", "has", "had", "do", "does", "did", "but", "at", "by", "with", "from", "here", "when", "where", "how", \
    "all", "any", "both", "each", "few", "more", "some", "such", "no", "nor", "too", "very", "can", "will", "just"]
    
    # LEARNER CODE START HERE
    result = {}
    for words in file_contents.split():
      words = words.strip(punctuations)
      if words in uninteresting_words:
        pass
      else:
        if words not in result:
            result[words]=1
        else:
            result[words]+=1
            
    print(result) # this line and the following 2 are just so i can see what how they show up
    print(punctuations)
    print(uninteresting_words)
    
    cloud = wordcloud.WordCloud()
    cloud.generate_from_frequencies(result)
    return cloud.to_array()

应该这样做。这是我需要的解决方案

https://www.python.org/dev/peps/pep-0257/

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM