[英]Append a column to Data Frame in Apache Spark 1.4 in Java
I am trying to add a column to my DataFrame that serves as a unique ROW_ID for the column. 我试图将一列添加到我的DataFrame中,以用作该列的唯一ROW_ID。 So, it would be something like this 1, user1 2, user2 3, user3 ... I could have done this easily using a hashMap with an integer iterating but I can't do this in spark using the map function on DataFrame since I can't have an integer increasing inside the map function. 因此,这将是这样的:1,user1 2,user2 3,user3 ...我可以使用带有整数迭代的hashMap轻松完成此操作,但是由于我无法使用DataFrame上的map函数在Spark中执行此操作不能在map函数内部增加整数。 Is there any way that I can do this by appending one column to my existing DataFrame or any other way? 有什么办法可以通过将一列追加到现有DataFrame或其他任何方式来做到这一点? PS: I know there is a very similar post , but that's for Scala and not java. PS:我知道有一个非常相似的帖子 ,但这是针对Scala而不是Java。
Thanks in advance 提前致谢
I did it by adding a column containing UUIDs in a new Column in DataFrame. 我通过在DataFrame的新列中添加包含UUID的列来做到这一点。
StructType objStructType = inputDataFrame.schema();
StructField []arrStructField=objStructType.fields();
List<StructField> fields = new ArrayList<StructField>();
List<StructField> newfields = new ArrayList<StructField>();
List <StructField> listFields = Arrays.asList(arrStructField);
StructField a = DataTypes.createStructField(leftCol,DataTypes.StringType, true);
fields.add(a);
newfields.addAll(listFields);
newfields.addAll(fields);
final int size = objStructType.size();
JavaRDD<Row> rowRDD = inputDataFrame.javaRDD().map(new Function<Row, Row>() {
private static final long serialVersionUID = 3280804931696581264L;
public Row call(Row tblRow) throws Exception {
Object[] newRow = new Object[size+1];
int rowSize= tblRow.length();
for (int itr = 0; itr < rowSize; itr++)
{
if(tblRow.apply(itr)!=null)
{
newRow[itr] = tblRow.apply(itr);
}
}
newRow[size] = UUID.randomUUID().toString();
return RowFactory.create(newRow);
}
});
inputDataFrame = objsqlContext.createDataFrame(rowRDD, DataTypes.createStructType(newfields));
Ok, I found the solution to this problem and I'm posting it in case someone would have the same problem: 好的,我找到了解决此问题的方法,并且将其发布,以防有人遇到相同的问题:
The way to do this it zipWithIndex from JavaRDD() 从JavaRDD()zipWithIndex执行此操作的方法
df.javaRDD().zipWithIndex().map(new Function<Tuple2<Row, Long>, Row>() { @Override public Row call(Tuple2<Row, Long> v1) throws Exception { return RowFactory.create(v1._1().getString(0), v1._2()); } })
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.