在算法中分離唯一值

Question

我正在將一系列90,000多個字符串分解為包含在字符串中的單個，非重復單詞對的離散列表，這些單詞對具有與每個字符串關聯的rxcui id值。 我已經開發出一種方法來嘗試實現此目的，但它會產生很多冗余。 數據分析表明，在我清理並格式化了字符串的內容之后，在90,000多個源字符串中大約有12,000個唯一詞。

如何更改下面的代碼，從而避免在目標2D ArrayList中創建冗余行（顯示在代碼下方）？

    public static ArrayList<ArrayList<String>> getAllWords(String[] tempsArray){//int count = tempsArray.length;
        int fieldslenlessthan2 = 0;//ArrayList<String> outputarr = new ArrayList<String>();
        ArrayList<ArrayList<String>> twoDimArrayList= new ArrayList<ArrayList<String>>();
        int idx = 0;
        for (String s : tempsArray) {
            String[] fields = s.split("\t");//System.out.println(" --- fields.length is: "+fields.length);
            if(fields.length>1){
                ArrayList<String> row = new ArrayList<String>();
                System.out.println("fields[0] is: "+fields[0]);
                String cleanedTerms = cleanTerms(fields[1]);
                String[] words = cleanedTerms.split(" ");
                for(int j=0;j<words.length;j++){
                    String word=words[j].trim();
                    word = word.toLowerCase();
                    if(isValidWord(word)){//outputarr.add(word);
                        System.out.println("words["+j+"] is: "+word);
                        row.add(word_id);//WORD_ID NEEDS TO BE CREATED BY SOME METHOD.
                        row.add(fields[0]);
                        row.add(word);
                        twoDimArrayList.add(row);
                        idx += 1;
                    }
                }
            }else{fieldslenlessthan2 += 1;}
        }
        System.out.println("........... fieldslenlessthan2 is: "+fieldslenlessthan2);
        return twoDimArrayList;
    }

目前，上述方法的輸出如下所示，其中一些名稱值包含許多rxcui值，而某些rxcui包含許多名稱值：

如何更改上面的代碼，以使輸出為名稱/ rxcui值的唯一對列表，匯總當前輸出中的所有相關數據，同時僅刪除冗余項？

Answer 1

如果只需要所有單詞的集合，則使用HashSet集主要用於包含邏輯。 如果需要將值與字符串關聯，請使用HashMap

public HashSet<String> getUniqueWords(String[] stringArray) {
  HashSet<String> uniqueWords = new HashSet<String>();
  for (String str : stringArray) {
    uniqueWords.add(str);
  }
  return uniqueWords;
}

這將為您提供數組中所有唯一字符串的集合。 如果您需要ID，請使用HashMap

String[] strList; // your String array
int idCounter = 0;
HashMap<String, Integer> stringIDMap = new HashMap<String, Integer>();

for (String str : strList) {
  if (!stringIDMap.contains(str)) {
    stringIDMap.put(str, new Integer(idCounter));
    idCounter++;
  }
}

這將為您提供具有唯一String鍵和唯一Integer值的HashMap。 要獲取String的ID，請執行以下操作：stringIDMap.get（“ myString”）; //返回與字符串“ myString”關聯的整數ID

更新基於OP中的問題更新。 我建議創建一個包含String值和rxcui的對象。 然后，您可以使用與上面提供的類似的實現，將它們放在Set或HashMap 。

public MyObject(String str, int rxcui); // The constructor for your new object
MyObject mo1 = new MyObject("hello", 5);

要么

mySet.add(myObject);

將工作或

myMap.put(mo1.getStr, mo1.getRxcui);

Answer 2

唯一單詞ID的目的是什么？ 單詞本身是否不夠唯一，因為您不保留重復單詞？

一個非常基本的方法是在檢查新單詞時保持計數器運行。 對於每個尚不存在的單詞，您可以增加計數器並將新值用作唯一ID。

最后，我是否建議您使用HashMap代替。 它可以讓您在O（1）時間插入和檢索單詞。 我不確定您要做什么，但是我認為HashMap可能會為您提供更多的范圍。

Edit2：在這些方面，可能會多一些。 這應該可以幫助您。

public static Set<DataPair> getAllWords(String[] tempsArray) {
    Set<DataPair> set = new HashSet<>();
    for (String row : tempsArray) {
        // PARSE YOUR STRING DATA
        // the way you were doing it seemed fine but something like this
        String[] rowArray = row.split(" ");
        String word = row[1];
        int id = Integer.parseInt(row[0]);
        DataPair pair = new DataPair(word, id);
        set.add(pair);
    }
    return set;
} 

class DataPair {
    private String word;
    private int id;

    public DataPair(String word, int id) {
        this.word = word;
        this.id = id;
    }

    public boolean equals(Object o) {
        if (o instanceof DataPair) {
            return ((DataPair) o).word.equals(word) && ((DataPair) o).id == id;
        }
        return false;
    }
}

在算法中分離唯一值

問題描述

2 個解決方案

解決方案1
2 2014-03-20 20:21:51

解決方案2
1 已采納 2014-03-20 20:07:48

在算法中分離唯一值

問題描述

2 個解決方案

解決方案1 2 2014-03-20 20:21:51

解決方案2 1 已采納 2014-03-20 20:07:48

解決方案1
2 2014-03-20 20:21:51

解決方案2
1 已采納 2014-03-20 20:07:48