简体   繁体   English

如何正确使用 Java 中 Apache 公共数学库中的 ZipfDistribution?

[英]How to use correctly ZipfDistribution from Apache commons math library in Java?

I want to create a source of data (in Java) based on words (from a dictionary) that follow a Zipf distribution.我想根据遵循 Zipf 分布的单词(来自字典)创建一个数据源(用 Java 编写)。 So I come to ZipfDistribution and NormalDistribution of the Apache commons library.于是我来到了Apache commons库的ZipfDistributionNormalDistribution Unfortunately, information about how to use these classes are rarely.不幸的是,关于如何使用这些类的信息很少。 I tried to do some tests but I am not sure if I am using it in the right manner.我试图做一些测试,但我不确定我是否以正确的方式使用它。 I am following only what is written in the documentation of each constructor.我只关注每个构造函数的文档中所写的内容。 But the results don't seem to be "well-distributed".但结果似乎并不“分布均匀”。

import org.apache.commons.math3.distribution.NormalDistribution;
import org.apache.commons.math3.distribution.ZipfDistribution;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;

public class ZipfDistributionDataSource extends RichSourceFunction<String> {
    private static final String DISTINCT_WORDS_URL = "https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt";

    public static void main(String[] args) throws Exception {
        ZipfDistributionDataSource zipfDistributionDataSource = new ZipfDistributionDataSource();
        StringBuffer stringBuffer = new StringBuffer(zipfDistributionDataSource.readDataFromResource());
        String[] words = stringBuffer.toString().split("\n");
        System.out.println("size: " + words.length);

        System.out.println("Normal Distribution");
        NormalDistribution normalDistribution = new NormalDistribution(words.length / 2, 1);
        for (int i = 0; i < 10; i++) {
            int sample = (int) normalDistribution.sample();
            System.out.print("sample[" + sample + "]: ");
            System.out.println(words[sample]);
        }

        System.out.println();
        System.out.println("Zipf Distribution");
        ZipfDistribution zipfDistribution = new ZipfDistribution(words.length - 1, 1);
        for (int i = 0; i < 10; i++) {
            int sample = zipfDistribution.sample();
            System.out.print("sample[" + sample + "]: ");
            System.out.println(words[sample]);
        }
    }

    private String readDataFromResource() throws Exception {
        URL url = new URL(DISTINCT_WORDS_URL);
        InputStream in = url.openStream();
        BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(in));
        StringBuilder builder = new StringBuilder();
        String line;
        try {
            while ((line = bufferedReader.readLine()) != null) {
                builder.append(line + "\n");
            }
            bufferedReader.close();

        } catch (IOException ioe) {
            ioe.printStackTrace();
        } catch (Exception e) {
            e.printStackTrace();
        }
        return builder.toString();
    }
}

output output

size: 370103
Normal Distribution
sample[185049]: metathesize
sample[185052]: metathetically
sample[185051]: metathetical
sample[185050]: metathetic
sample[185049]: metathesize
sample[185050]: metathetic
sample[185052]: metathetically
sample[185050]: metathetic
sample[185052]: metathetically
sample[185050]: metathetic

Zipf Distribution
sample[11891]: anaphasic
sample[314]: abegge
sample[92]: abandoner
sample[3]: aah
sample[36131]: blepharosynechia
sample[218]: abbozzo
sample[8]: aalii
sample[5382]: affing
sample[6394]: agoraphobia
sample[4360]: adossed

You are using it just fine from a code perspective:) The problem is in assuming the source material is ordered by Zipf when it is clearly alphabetical.从代码的角度来看,您可以很好地使用它:) 问题在于假设源材料是由 Zipf 排序的,而它显然是按字母顺序排列的。 The whole point of using ZipfDistribution is that words[0] must be the most common word (hint: it's 'the') and roughly twice the freq of words[1]) etc.使用ZipfDistribution的重点是 words[0] 必须是最常见的词(提示:它是 'the')并且大约是 words[1] 频率的两倍)等。

https://en.wikipedia.org/wiki/Word_lists_by_frequency https://en.wikipedia.org/wiki/Most_common_words_in_English https://en.wikipedia.org/wiki/Word_lists_by_frequency https://en.wikipedia.org/wiki/Most_common_words_in_English

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM