简体   繁体   English

将Pandas DataFrame转换为LIBFM格式的txt文件

[英]Transform Pandas DataFrame to LIBFM format txt file

I want to transform a Pandas Data frame in python to a sparse matrix txt file in the LIBFM format. 我想将python中的Pandas Data帧转换为LIBFM格式的稀疏矩阵txt文件。

Here the format needs to look like this: 这里的格式需要如下所示:

4   0:1.5   3:-7.9
2   1:1e-5  3:2
-1  6:1

This file contains three cases. 该文件包含三种情况。 The first column states the target of each of the three case: ie 4 for the first case, 2 for the second and -1 for the third. 第一列说明了三种情况中每种情况的目标:即第一种情况为4,第二种情况为2,第三种情况为-1。 After the target, each line contains the non-zero elements of x, where an entry like 0:1.5 reads x0 = 1.5 and 3:-7.9 means x3 = −7.9, etc. That means the left side of INDEX:VALUE states the index within x whereas the right side states the value of x. 在目标之后,每行包含x的非零元素,其中0:1.5的条目读取x0 = 1.5和3:-7.9表示x3 = -7.9等。这意味着INDEX的左侧:VALUE表示x中的索引,而右边则表示x的值。

In total the data from the example describes the following design matrix X and target vector y: 总的来说,该示例中的数据描述了以下设计矩阵X和目标矢量y:

   1.5  0.0   0.0  −7.9  0.0  0.0  0.0
X: 0.0  10−5  0.0  2.0   0.0  0.0  0.0
   0.0  0.0   0.0  0.0   0.0  0.0  1.0

   4
Y: 2
  −1

This is also explained in the Manual file under chapter 2. 第2章的手册文件中也对此进行了解释。

Now here is my problem: I have a pandas dataframe that looks like this: 现在这是我的问题:我有一个像这样的pandas数据框:

  overall reviewerID        asin       brand         Positive Negative  \
0  5.0   A2XVJBSRI3SWDI  0000031887  Boutique Cutie     3.0       -1
1  4.0   A2G0LNLN79Q6HR  0000031887  Boutique Cutie     5.0       -2
2  2.0   A2R3K1KX09QBYP  0000031887  Boutique Cutie     3.0       -2
3  1.0   A19PBP93OF896   0000031887  Boutique Cutie     2.0       -3
4  4.0   A1P0IHU93EF9ZK  0000031887  Boutique Cutie     2.0       -2

  LDA_0     LDA_1      ...    LDA_98      LDA_99
0  0.000833  0.000833  ...    0.000833    0.000833
1  0.000769  0.000769  ...    0.000769    0.000769
2  0.000417  0.000417  ...    0.000417    0.000417
3  0.000137  0.014101  ...    0.013836    0.000137
4  0.000625  0.000625  ...    0.063125    0.000625

Where "overall" is the target column and all other 105 columns are features. 其中“整体”是目标列,其他所有105列都是特征。

The 'ReviewerId', 'Asin' and 'Brand' columns needs to be changed to dummy variables. 需要将'ReviewerId','Asin'和'Brand'列更改为虚拟变量。 So each unique 'ReviewerID', 'Asin' and brand gets his own column. 因此,每个独特的“ReviewerID”,“Asin”和品牌都有自己的专栏。 This means if 'ReviewerID' has 100 unique values you get 100 columns where the value is 1 if that row represents the specific Reviewer and else zero. 这意味着如果'ReviewerID'有100个唯一值,则获得100列,如果该行代表特定的Reviewer,则值为1,否则为零。

All other columns don't need to get reformatted. 所有其他列不需要重新格式化。 So the index for those columns can just be the column number. 因此,这些列的索引可以只是列号。

So the first 3 rows in the above pandas data frame need to be transformed to the following output: 所以上面的pandas数据框中的前3行需要转换为以下输出:

5 0:1 5:1 6:1 7:3 8:-1 9:0.000833 10:0.000833 ... 107:0.000833 108:0.00833
4 1:1 5:1 6:1 7:5 8:-2 9:0.000769 10:0.000769 ... 107:0.000769 108:0.00769
2 2:1 5:1 6:1 7:3 8:-2 9:0.000417 10:0.000417 ... 107:0.000417 108:0.000417

In the LIBFM] package there is a program that can transform the User - Item - Rating into the LIBFM output format. 在LIBFM]包中有一个程序可以将User-Item-Rating转换为LIBFM输出格式。 However this program can't get along with this many columns. 但是这个程序无法与这么多列相提并论。

Is there an easy way to do this? 是否有捷径可寻? I have 1 million rows in total. 我总共有100万行。

LibFM executable expects the input in libSVM format that you have explained here. LibFM可执行文件需要您在此处说明的libSVM格式的输入。 If the file converter in the LibFM package do not work for your data, try the scikit learn sklearn.datasets.dump_svmlight_file method. 如果LibFM包中的文件转换器不适用于您的数据,请尝试使用scikit learn sklearn.datasets.dump_svmlight_file方法。

Ref: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.dump_svmlight_file.html 参考: http//scikit-learn.org/stable/modules/generated/sklearn.datasets.dump_svmlight_file.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM