简体   繁体   English

如何添加新列以触发数据框取决于multipme现有列?

[英]how to add new column to spark dataframe depend on multipme existing column?

I am going to add new column to a dataframe. 我将向数据框添加新列。 for example, I have a dataframe df 例如,我有一个数据框df

    id|c_1         |c_2          |c_3       |c_4    |.......|c_200    |c_tot
    1 |[1,2,3,5]   |[t,4,bv,55,2]|[]        |[1,22] |       |[k,v,c,x]|[1,2,3,4,5,t,bv,55,22,k,v,c,x]
    2 |[1,2,4]     |[4,3,8]      |[6,7]     |[10,12]        | [11]    |[1,2,3,4,6,7,8,10,11,12]
    .
    .

I want to get some statistique from my dataframe, for example I want a new column that contain Entropy for each id so we must calculate pi for each c_i and then calculate entropy 我想从我的数据框中获取一些统计信息,例如,我想要一个包含每个id的熵的新列,因此我们必须为每个c_i计算pi,然后计算熵

    pi=(size(c_i)+1))/(size(c_tot)+1)
    Entropy=-sum(pi*ln(pi))   \\i in[1,200]

for example for the first value of the new column entropy must be 例如,对于新列的第一个值,熵必须为

    entropy=-((5/14*ln(5/14))+(6/14*ln(6/14))+(1/14*ln(1/14)).... +(5/14)*ln(5/14))

I know that i can work with expression link but don't found idea for the expression because i have multiple column. 我知道我可以使用表达式链接,但是由于我有多个列,所以找不到表达式的想法。

Your expression can be slightly simplified to: 您的表达可以稍微简化为:

熵

To generate that in Scala: 在Scala中生成该代码:

entropy = (1 to 200).toSeq
                    .map(c => s" ( size(c_$c) + 1 ) * ln( (size(c_$c) + 1) / (size(c_tot) + 1) ) ")
                    .mkString("-(" , "+" , ") / size(c_tot) ")

And then use it with expr 然后与expr

df.withColumn("entropy" , expr(entropy) )

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何基于Spark Scala中的现有列添加新列 - How add new column based on existing column in spark scala 如何使用Scala / Spark 2.2将列添加到现有DataFrame并使用window函数在新列中添加特定行 - How to add a column to the existing DataFrame and using window function to add specific rows in the new column using Scala/Spark 2.2 在现有列的DataFrame中添加新列 - Add new column in DataFrame base on existing column 如何在Spark中的数据框中添加列,其值取决于第2个数据帧的内容? - How can I add a column to a dataframe in Spark, whose values will depend on the contents of a 2nd dataframe? 如何将包含值 0…n 的列添加到 spark 中的现有 dataframe? - How to add column containing values 0…n to existing dataframe in spark? 如何进行 groupby 排名并将其作为列添加到 spark scala 中的现有 dataframe? - How to do a groupby rank and add it as a column to existing dataframe in spark scala? 关于如何在 Scala 中使用随机值向现有 DataFrame 添加新列 - About how to add a new column to an existing DataFrame with random values in Scala 如何在向现有数据框添加新列的同时指定其数据类型? - How to add a new column to an existing dataframe while also specifying the datatype of it? Apache Spark,将一个“CASE WHEN ... ELSE ...”计算列添加到现有的DataFrame中 - Apache Spark, add an “CASE WHEN … ELSE …” calculated column to an existing DataFrame 如何在数据框中添加新列并填充列? - How add a new column to in dataframe and populate the column?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM