簡體   English   中英

如何在Apache Spark Java中將數據集的數組類型轉換為字符串類型

[英]How to convert array type of dataset into string type in Apache Spark Java

我的數據集中有一個數組類型需要轉換為字符串類型。 我以常規方式嘗試過。 我覺得我們可以做得更好。 你能指導我嗎? 輸入數據集1

    +---------------------------+-----------+-------------------------------------------------------------------------------------------------+
    ManufacturerSource         |upcSource  |productDescriptionSource                                                                          |                                                                                                                                                                            |
    +---------------------------+-----------+-------------------------------------------------------------------------------------------------+
    |3M                         |51115665883|[c, gdg, whl, t27, 5, x, 1, 4, x, 7, 8, grindig, flap, wheels, 36, grit, 12, 250, rpm]           |                                                                                                                                                                            |
    |3M                         |51115665937|[c, gdg, whl, t27, q, c, 6, x, 1, 4, x, 5, 8, 11, grinding, flap, wheels, 36, grit, 10, 200, rpm]|                                                                                                                                                                             |
    |3M                         |0          |[3mite, rb, cloth, 3, x, 2, wd]                                                                  |                                                                                                                                                                             |
    |3M                         |0          |[trizact, disc, cloth, 237aaa16x5, hole]                                                         |                                                                                                                                                                             |
    -------------------------------------------------------------------------------------------------------------------------------------------

預期輸出數據集

     +---------------------------+-----------+---------------------------------------------------------------------------------------------------|
     |ManufacturerSource         |upcSource  |productDescriptionSource                                                                           |                                                                                                                                                                           |
     +---------------------------+-----------+---------------------------------------------------------------------------------------------------|
     |3M                         |51115665883|c gdg whl t27 5 x 1 4 x 7 8 grinding flap wheels 36 grit 12 250 rpm               |                |                                                                                                                                                         |
     |3M                         |51115665937|c gdg whl t27 q c 6 x 1 4 x 5 8 11 grinding flap wheels 36 grit 10 200 rpm                         |                                                                                                                                                                        |
     |3M                         |0          |3mite  rb  cloth  3  x  2  wd                                                                      |                                                                                                                                                                          |
     |3M                         |0          |trizact  disc  cloth  237aaa16x5  hole                                                             |                                                                                                                                                                          |
     +-------------------------------------------------------------------------------------------------------------------------------------------|

常規方法1

        Dataset<Row> afterstopwordsRemoved = 
         stopwordsRemoved.select("productDescriptionSource");
          stopwordsRemoved.show();

        List<Row> individaulRows= afterstopwordsRemoved.collectAsList();

        System.out.println("After flatmap\n");
        List<String> temp;
        for(Row individaulRow:individaulRows){
         temp=individaulRow.getList(0);
        System.out.println(String.join(" ",temp));
        }

方法2(未產生結果)

異常:無法執行用戶定義的函數($ anonfun $ 27 :(數組)=>字符串)

       UDF1 untoken = new UDF1<String,String[]>() {
        public String call(String[] token) throws Exception {
            //return types.replaceAll("[^a-zA-Z0-9\\s+]", "");
             return Arrays.toString(token); 
        }

        @Override
        public String[] call(String t1) throws Exception {
            // TODO Auto-generated method stub
            return null;
        }
    };

    sqlContext.udf().register("unTokenize", untoken, DataTypes.StringType);

    source.createOrReplaceTempView("DataSetOfTokenize");
    Dataset<Row> newDF = sqlContext.sql("select *,unTokenize(productDescriptionSource)FROM DataSetOfTokenize");
    newDF.show(4000,false);

我會用concat_ws

sqlContext.sql("select *, concat_ws(' ', productDescriptionSource) FROM DataSetOfTokenize");

要么:

import static org.apache.spark.sql.functions.*;

df.withColumn("foo" ,concat_ws(" ", col("productDescriptionSource")));

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM