简体   繁体   中英

scikit-learn to learn and generate list of numbers

I have a large data of n-hundred-dimensional list of triplets consisting of numbers, mostly integers.

[(50,100,0.5),(20,35,1.0),.....]
[(70,80,0.3),(30,45,2.0),......]
....

I'm looking at sklearn to write a simple generative model that learns the patterns from these data, and generate a likely list of triplets, but my background is rather weak, without which the documentation is rather difficult to follow.

Is there an example sklearn code that does the similar job where I can take a look at?

I agree that this question is probably more appropriate for the data science or statistics sites, but I'll take a stab at it.

First, I'll assume that your data is in a pandas dataframe; this is convenient for scikit-learn as well as other Python packages.

I would first visualize the data. Since you only have three dimensions, a three-dimensional scatter plot might be useful. For instance, see here .

Another useful way to plot the data is to use pair plots. The seaborn package makes this very easy. See here . Pair plots are useful because they show distributions of each of the variables/features, as well as correlations between pairs of features.

At this point, creating a generative model depends on what the plots tell you. If, for instance, all of the variables are independent of one another, then you simply need to estimate the pdf for each variable independently (for instance, using kernel density estimation , which is also implemented in seaborn ), and then generate new samples by drawing values from each of the three distributions separately and combining these values in a single tuple.

If the variables are not independent, then the task becomes more complicated, and probably warrants a separate post on the statistics site. For instance, your samples could be generated from different clusters, possibly overlapping, in which case something like a mixture model might be useful.

Here is a small code example that does exactly that (discriminative model):

import numpy as np
from sklearn.linear_model import LinearRegression

#generate random numpy array of the size 10,3
X_train = np.random.random((10,3))
y_train = np.random.random((10,3))
X_test = np.random.random((10,3))

#define the regression
clf = LinearRegression()

#fit & predict (predict returns numpy array of the same dimensions)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

Otherwise here are more examples:

http://scikit-learn.org/stable/auto_examples/index.html

The generative model would be sklearn.mixture.GaussianMixture (works only in version 0.18)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM