[英]Apache Spark: How to join multiple columns (features) from csv with Tokenizer in Java?
I have a csv file, with three columns: Id, Main_user and Users. 我有一个csv文件,有三列:Id,Main_user和Users。 Id is the label and both other values as features.
Id是标签,其他值都是功能。 Now I want to load the two features (main_user and users) from the csv, vectorize them and assemble them as one vector.
现在我想从csv加载两个功能(main_user和用户),将它们矢量化并将它们组装成一个向量。 After using HashingTF as described in the documentation , how do I add a second feature "Main_user", in addition to the feature "Users".
使用如描述HashingTF后的文件 ,我怎么添加第二个特征“Main_user”,除了有“用户”。
DataFrame df = (new CsvParser()).withUseHeader(true).csvFile(sqlContext, csvFile);
Tokenizer tokenizer = new Tokenizer().setInputCol("Users").setOutputCol("words");
DataFrame wordsData = tokenizer.transform(df);
int numFeatures = 20;
HashingTF hashingTF = new HashingTF().setInputCol("words")
.setOutputCol("rawFeatures").setNumFeatures(numFeatures);
ok I found a solution. 好的,我找到了解决方案。 Load the columns one after another, tokenize, hashTF and at the end assemble them.
一个接一个地加载列,标记化,hashTF,最后组装它们。 I would appreciate any improvement to this.
我将不胜感激任何改进。
DataFrame df = (new CsvParser()).withUseHeader(true).csvFile(sqlContext, csvFile);
Tokenizer tokenizer = new Tokenizer();
HashingTF hashingTF = new HashingTF();
int numFeatures = 35;
tokenizer.setInputCol("Users")
.setOutputCol("Users_words");
DataFrame df1 = tokenizer.transform(df);
hashingTF.setInputCol("Users_words")
.setOutputCol("rawUsers").setNumFeatures(numFeatures);
DataFrame featurizedData1 = hashingTF.transform(df1);
tokenizer.setInputCol("Main_user")
.setOutputCol("Main_user_words");
DataFrame df2 = tokenizer.transform(featurizedData1);
hashingTF.setInputCol("Main_user_words")
.setOutputCol("rawMain_user").setNumFeatures(numFeatures);
DataFrame featurizedData2 = hashingTF.transform(df2);
// Now Assemble Vectors
VectorAssembler assembler = new VectorAssembler()
.setInputCols(new String[]{"rawUsers", "rawMain_user"})
.setOutputCol("assembeledVector");
DataFrame assembledFeatures = assembler.transform(featurizedData2);
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.