簡體   English   中英

使用停用詞過濾在文本字段上獲取重要文本聚合

[英]Get Significant Text aggregation on text field with stop words filtering

我正在嘗試在索引的文本字段(稱為“文本”)中搜索最常用的單詞。 我已經設法使用“重要文本”聚合來執行此操作,但是返回的一些存儲桶包含“the”、“a”、“它們”等詞。我該如何過濾掉它們? 我嘗試使用停用詞分析器,但它仍然沒有幫助。 我也嘗試使用“gnd”,據說這有助於解決這個問題,但我仍然得到了大致相同的結果。

我的查詢:

GET feed/_search
{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggs": {
    "by_sentiment": {
        "terms": {
            "field": "sentiment.Sentiment.keyword",
            "size": 50
        },
        "aggs": {
            "trending_topics": {
                "significant_text": {
                    "field": "text",
                    "filter_duplicate_text": true,
                }
            }
        }
    },
    "by_level": {
        "terms": {
            "field": "level",
            "size": 50
        },
        "aggs": {
            "trending_topics": {
                "significant_text": {
                    "field": "text",
                    "filter_duplicate_text": true,
                }
            }
        }
    },
    "by_asset": {
        "terms": {
            "field": "asset_id",
            "size": 50
        },
        "aggs": {
            "trending_topics": {
                "significant_text": {
                    "field": "text",
                    "filter_duplicate_text": true,
                }
            }
        }
    }
  }
}

我設法通過添加一個

"exclude": ["list","of","stop","words"]

到每個“significant_text”聚合。 對於任何感興趣的人,這是我使用的確切列表:

"exclude": ["t.co", "https", "rt", "l", "they", "i", "I", "you", "this", "that", "but", "its", "s", "for", "there", "going", "try", "into", "me", "don’t", "every", "because", "got", "thank", "thanks", "looks", "cha", "been", "would", "my", "from", "now", "and", "im", "mine", "u", "the", "to", "can't", "than", "cant", "in", "self", "of", "with", "your", "is", "do", "not", "ii", "despite", "however", "there's", "isn't", "seems", "though", "a", "via", "will", "also", "that's", "even", "we", "anymore", "anyone", "all", "have", "on", "if", "sure", "as", "at", "are", "it", "so", "be", "are", "everyone", "just", "can", "by", "what", "does", "please", "an", "these", "de", "how", "he", "haha", "were", "us", "should", "when", "or", "o", "another", "those", "am", "yourselves", "don't", "without", "then", "gotta", "myself", "we'll", "our", "we've", "www.reddit.com", "know", "number", "which", "while", "name", "comments", "up", "you're", "seem", "isn't", "being", "them", "ha", "perhaps", "about", "has", "each", "something", "haven't", "their", "t.me", "r", "est", "la", "le", "vous", "et", "à", "les", "pour", "avec", "el", "en", "que", "para", "no"]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM