简体   繁体   English

Apache Spark:如何使用Java中的Tokenizer从csv连接多个列(功能)?

[英]Apache Spark: How to join multiple columns (features) from csv with Tokenizer in Java?

I have a csv file, with three columns: Id, Main_user and Users. 我有一个csv文件,有三列:Id,Main_user和Users。 Id is the label and both other values as features. Id是标签,其他值都是功能。 Now I want to load the two features (main_user and users) from the csv, vectorize them and assemble them as one vector. 现在我想从csv加载两个功能(main_user和用户),将它们矢量化并将它们组装成一个向量。 After using HashingTF as described in the documentation , how do I add a second feature "Main_user", in addition to the feature "Users". 使用如描述HashingTF后的文件 ,我怎么添加第二个特征“Main_user”,除了有“用户”。

DataFrame df = (new CsvParser()).withUseHeader(true).csvFile(sqlContext, csvFile);
Tokenizer tokenizer = new Tokenizer().setInputCol("Users").setOutputCol("words");        
DataFrame wordsData = tokenizer.transform(df);
int numFeatures = 20;
HashingTF hashingTF = new HashingTF().setInputCol("words")
                .setOutputCol("rawFeatures").setNumFeatures(numFeatures);

ok I found a solution. 好的,我找到了解决方案。 Load the columns one after another, tokenize, hashTF and at the end assemble them. 一个接一个地加载列,标记化,hashTF,最后组装它们。 I would appreciate any improvement to this. 我将不胜感激任何改进。

DataFrame df = (new CsvParser()).withUseHeader(true).csvFile(sqlContext, csvFile);

Tokenizer tokenizer = new Tokenizer();
HashingTF hashingTF = new HashingTF();
int numFeatures = 35;

tokenizer.setInputCol("Users")
        .setOutputCol("Users_words");
DataFrame df1 = tokenizer.transform(df);
hashingTF.setInputCol("Users_words")
        .setOutputCol("rawUsers").setNumFeatures(numFeatures);
DataFrame featurizedData1 = hashingTF.transform(df1);

tokenizer.setInputCol("Main_user")
        .setOutputCol("Main_user_words");
DataFrame df2 = tokenizer.transform(featurizedData1);          
hashingTF.setInputCol("Main_user_words")
        .setOutputCol("rawMain_user").setNumFeatures(numFeatures);
DataFrame featurizedData2 = hashingTF.transform(df2);             

// Now Assemble Vectors
VectorAssembler assembler = new VectorAssembler()
        .setInputCols(new String[]{"rawUsers", "rawMain_user"})
        .setOutputCol("assembeledVector");

DataFrame assembledFeatures = assembler.transform(featurizedData2);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将多个列分组并从 Apache Spark 中的 DataFrame 计数 - Java - Grouping multiple columns and count from a DataFrame in Apache Spark - Java 如何使用 Java 在 Spark SQL 中加入多列以在 DataFrame 中进行过滤 - How to Join Multiple Columns in Spark SQL using Java for filtering in DataFrame 如何在 Apache spark java 中将行从 csv 转换为 ArrayType? - how to convert row from csv to ArrayType in Apache spark java? 如何将数字和分类特征传递给Apache Spark中的RandomForestRegressor:Java中的MLlib? - How to pass numerical and categorical features to RandomForestRegressor in Apache Spark: MLlib in Java? Apache Nifi:从 csv 中删除多列 - Apache Nifi: Removing multiple columns from a csv 如何使用java从Csv打印多列 - How To print multiple Columns from Csv using java 如何从 Java 连接到 csv 文件并将其写入 Databricks Apache Spark 的远程实例? - How do I connect to and write a csv file to a remote instance of Databricks Apache Spark from Java? 连接,聚合然后选择Apache Spark中的特定列 - Join, Aggregate Then Select Specific Columns In Apache Spark 如何在 Spark Java 中将数组分解为多列 - How to explode an array into multiple columns in Spark Java 如何使用 java Spark 编码从 CSV 文件中选择 3 列也分组并最终求和 - How to select 3 columns from CSV file using java Spark coding also group by and finally sum
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM