[英]how to create inverted index in java
I want to build an inverted index in java. 我想在Java中建立反向索引。 I have cran data of 1400 text files.
我有1400个文本文件的cran数据。 I was able to count the frequency of each term/word.
我能够计算出每个术语/单词的频率。 I have been able to return the number times a word appears in the entire collection, but I have not been able to return which documents the word appears in. This is the code I have so far:
我已经能够返回单词在整个集合中出现的次数,但是我还无法返回单词出现在哪些文档中。这是到目前为止的代码:
I want the output in the following form term1: doc1:2, doc2:3 term2: doc1:3, doc4:1 ............... so on 我想要以下形式的输出term1:doc1:2,doc2:3 term2:doc1:3,doc4:1 ......依此类推
here term is a word in a doc file and doc 1:2 means term1 appears in doc 1 2 times 这里term是doc文件中的单词,而doc 1:2表示term1在doc 1中出现了2次
public static void main(String[]args) throws FileNotFoundException{
Map<String, Integer> m = new HashMap<>();
String wrd;
for(int i=1;i<=2;i++){
//FileInputStream tdfr = new FileInputStream("D:\\logs\\steem"+i+".txt");
Scanner tdsc=new Scanner(new File("D:\\logs\\steem"+i+".txt"));
while(tdsc.hasNext()){
// m.clear();
Integer docid=i;
wrd=tdsc.next();
//Vector<Integer> vPosList = p.hPosList.get(wrd);
Integer freq=m.get(wrd);
//Integer doc=m1.get(i);
//System.out.println(m.get(wrd));
m.put(wrd, (freq == null) ? 1 : freq + 1);
}
System.out.println(m.size() + " distinct words" + " steem" +i);
System.out.println("Doc" +i+""+m);
//System.out.println("Doc"+i+""+m1);
m.clear();
tdsc.close();
}
//System.out.println(m.size() + " distinct words");
//System.out.println(m);
// System.out.println(m1);
}
}
public static void main(String[]args) throws FileNotFoundException{
Map<String, Set<Doc>> wordDocMap = new HashMap<>();
for(int i=1;i<=2;i++){
Scanner tdsc = new Scanner(new File("D:\\logs\\steem"+i+".txt"));
Doc document = new Doc("doc"+i);
while(tdsc.hasNext()){
String word = tdsc.next();
document.put(word);
Set<Doc> documents = wordDocMap.get(word);
if(documents == null){
documents = new HashSet<>();
wordDocMap.put(word, documents);
}
documents.add(document);
}
tdsc.close();
}
StringBuilder builder = new StringBuilder();
for(String word: wordDocMap.keySet()) {
Set<Doc> documents = wordDocMap.get(word);
builder.append(word + ":");
for(Doc document:documents){
builder.append(document.getDocName() +":"+ document.getCount(word));
builder.append(", ");
}
builder.delete(builder.length()-2, builder.length()-1);
builder.append("\n");
}
System.out.println(builder);
}
static class Doc {
String docName;
Map<String, Integer> m = new HashMap<>();
public Doc(String docName){
this.docName = docName;
}
public void put(String word) {
Integer freq = m.get(word);
m.put(word, (freq == null) ? 1 : freq + 1);
}
public Integer getCount(String word) {
return m.get(word);
}
public String getDocName() {
return this.docName;
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.