简体   繁体   中英

Transform Pandas DataFrame to LIBFM format txt file

I want to transform a Pandas Data frame in python to a sparse matrix txt file in the LIBFM format.

Here the format needs to look like this:

4   0:1.5   3:-7.9
2   1:1e-5  3:2
-1  6:1

This file contains three cases. The first column states the target of each of the three case: ie 4 for the first case, 2 for the second and -1 for the third. After the target, each line contains the non-zero elements of x, where an entry like 0:1.5 reads x0 = 1.5 and 3:-7.9 means x3 = −7.9, etc. That means the left side of INDEX:VALUE states the index within x whereas the right side states the value of x.

In total the data from the example describes the following design matrix X and target vector y:

   1.5  0.0   0.0  −7.9  0.0  0.0  0.0
X: 0.0  10−5  0.0  2.0   0.0  0.0  0.0
   0.0  0.0   0.0  0.0   0.0  0.0  1.0

   4
Y: 2
  −1

This is also explained in the Manual file under chapter 2.

Now here is my problem: I have a pandas dataframe that looks like this:

  overall reviewerID        asin       brand         Positive Negative  \
0  5.0   A2XVJBSRI3SWDI  0000031887  Boutique Cutie     3.0       -1
1  4.0   A2G0LNLN79Q6HR  0000031887  Boutique Cutie     5.0       -2
2  2.0   A2R3K1KX09QBYP  0000031887  Boutique Cutie     3.0       -2
3  1.0   A19PBP93OF896   0000031887  Boutique Cutie     2.0       -3
4  4.0   A1P0IHU93EF9ZK  0000031887  Boutique Cutie     2.0       -2

  LDA_0     LDA_1      ...    LDA_98      LDA_99
0  0.000833  0.000833  ...    0.000833    0.000833
1  0.000769  0.000769  ...    0.000769    0.000769
2  0.000417  0.000417  ...    0.000417    0.000417
3  0.000137  0.014101  ...    0.013836    0.000137
4  0.000625  0.000625  ...    0.063125    0.000625

Where "overall" is the target column and all other 105 columns are features.

The 'ReviewerId', 'Asin' and 'Brand' columns needs to be changed to dummy variables. So each unique 'ReviewerID', 'Asin' and brand gets his own column. This means if 'ReviewerID' has 100 unique values you get 100 columns where the value is 1 if that row represents the specific Reviewer and else zero.

All other columns don't need to get reformatted. So the index for those columns can just be the column number.

So the first 3 rows in the above pandas data frame need to be transformed to the following output:

5 0:1 5:1 6:1 7:3 8:-1 9:0.000833 10:0.000833 ... 107:0.000833 108:0.00833
4 1:1 5:1 6:1 7:5 8:-2 9:0.000769 10:0.000769 ... 107:0.000769 108:0.00769
2 2:1 5:1 6:1 7:3 8:-2 9:0.000417 10:0.000417 ... 107:0.000417 108:0.000417

In the LIBFM] package there is a program that can transform the User - Item - Rating into the LIBFM output format. However this program can't get along with this many columns.

Is there an easy way to do this? I have 1 million rows in total.

LibFM executable expects the input in libSVM format that you have explained here. If the file converter in the LibFM package do not work for your data, try the scikit learn sklearn.datasets.dump_svmlight_file method.

Ref: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.dump_svmlight_file.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM