檢查文本中是否存在一對的最有效方法

Question

介紹：

許多情感分析程序使用的功能之一是通過根據詞典向相關的字母，二元組或成對分配特定分數來計算的。 更詳細：

詞典的示例可以是：

//unigrams
good 1
bad -1
great 2
//bigrams
good idea 1
bad idea -1
//pairs (--- stands for whatever):
hold---up   -0.62
how---i still -0.62

給定一個示例文本T，在T的每個的每個單字組，二元或對i想檢查的對應存在於詞典。

unigram \\ bigram部分很簡單：我將詞典加載到Map中，然后迭代我的文本，檢查字典中是否存在每個單詞。 我的問題是檢測對。

我的問題：

檢查我的文本中是否存在特定對的一種方法是迭代對的整個詞典，並在文本上使用正則表達式。 檢查文本中是否存在“ start_of_pair。* end_of_pair”中的每個單詞。 這似乎非常浪費，因為我必須為每個文本重復遍歷整個詞典。 關於如何以更智能的方式執行此操作的任何想法？

相關問題：檢查文件中單詞和Java 列表的最有效方法：檢查字符串是否在單詞列表中的最有效方法

Answer 1

可以將二元組的頻率圖實現為：

Map<String, Map<String, Integer> bigramFrequencyMap = new TreeMap<>();

用初始頻率為0的所需二元填充地圖。第一個詞位，第二個詞位，進行頻率計數。

static final int MAX_DISTANCE = 5;

然后，詞法掃描將保留最后的#MAX_DISTANCE個詞素。

List<Map<String, Integer>> lastLexemesSecondFrequencies = new ArrayList<>();

void processLexeme() {
     String lexeme = readLexeme();

     // Check whether there is a bigram:
     for (Map<String, Integer> prior : lastLexemesSecondFrequencies) {
          Integer freq = prior.get(lexeme);
          if (freq != null) {
              prior.put(lexeme, 1 + freq);
          }
     }

     Map<String, Integer> lexemeSecondFrequencies =
             bigramFrequencyMap.get(lexeme);
     if (lexemeSecondFrequencies != null) {
         // Could remove lexemeSecondFrequencies if present in lastLexemes.
         lastLexems.add(0, lexemeSecondFrequencies); // addFirst
         if (lastLexemes.size() > MAX_DISTANCE) {
             lastLexemes.remove(lastLexemes.size() - 1); // removeLast
         }
     }
}

優化是保留二元組的后半部分，並且僅處理已注冊的二元組。

Answer 2

最后，我以這種方式解決了這個問題：我將詞對對加載為Map<String, Map<String, Float>> -其中第一個鍵是對的前半部分，內部映射保存了所有可能的結尾該鍵的開始以及相應的情感值。

基本上，我有一個可能的結尾列表（ enabledTokens ），每次我讀取一個新的令牌時，它都會增加-然后我搜索此列表以查看當前令牌是否為某些先前對的結尾。

經過一些修改以防止先前的令牌立即用於結尾，這是我的代碼：

private Map<String, Map<String, Float>> firstPartMap;
private List<LexiconPair> enabledTokensForUnigrams, enabledTokensForBigrams;
private Queue<List<LexiconPair>> pairsForBigrams; //is initialized with two empty lists
private Token oldToken;

public void parseToken(Token token) {
    String unigram = token.getText();
    String bigram = null;
    if (oldToken != null) {
        bigram = oldToken.getText() + " " + token.getText();
    }

    checkIfPairMatchesAndUpdateFeatures(unigram, enabledTokensForUnigrams);
    checkIfPairMatchesAndUpdateFeatures(bigram, enabledTokensForBigrams);

    List<LexiconPair> pairEndings = toPairs(firstPartMap.get(unigram));
    if(bigram!=null)pairEndings.addAll(toPairs(firstPartMap.get(bigram)));
    pairsForBigrams.add(pairEndings);

    enabledTokensForUnigrams.addAll(pairEndings);
    enabledTokensForBigrams.addAll(pairsForBigrams.poll());

    oldToken = token;
}
private void checkIfPairMatchesAndUpdateFeatures(String text, List<LexiconPair> listToCheck) {
    Iterator<LexiconPair> iter = listToCheck.iterator();
    while (iter.hasNext()) {
        LexiconPair next = iter.next();
        if (next.getText().equals(text)) {
            float val = next.getValue();
            POLARITY polarity = getPolarity(val);
            for (LexiconFeatureSubset lfs : lexiconsFeatures) {
                lfs.handleNewValue(Math.abs(val), polarity);
            }
            //iter.remove();
            //return; //remove only 1 occurrence
        }
    }
}

檢查文本中是否存在一對的最有效方法

問題描述

介紹：

詞典的示例可以是：

我的問題：

2 個解決方案

解決方案1
0 2014-01-20 19:25:48

解決方案2
0 2014-01-22 19:07:27

檢查文本中是否存在一對的最有效方法

問題描述

介紹：

詞典的示例可以是：

我的問題：

2 個解決方案

解決方案1 0 2014-01-20 19:25:48

解決方案2 0 2014-01-22 19:07:27

解決方案1
0 2014-01-20 19:25:48

解決方案2
0 2014-01-22 19:07:27