简体   繁体   English

如何在文件 Scala spark 中输出字段填充?

[英]How to output field padding in file Scala spark?

I have a text file.我有一个文本文件。 Now, I want output field padding in file as Exp1 & Exp2 .现在,我希望将文件中的输出字段填充为Exp1Exp2 What should I do?我该怎么办? This is my input:这是我的输入:

a
a a
a a a
a a a a
a a a a a

Exp1.经验1。 Fill the remaining fields with the - character when each record in the file does not fit into the n=4 field.当文件中的每条记录不适合n=4字段时,用-字符填充其余字段。

a _ _ _
a a _ _
a a a _
a a a a
a a a a a

Exp2.经验2。 Same as above.和上面一样。 Delete the fields after the n=4 field when the number of fields in the record exceeds n .当记录中的字段数超过n时,删除n=4字段之后的字段。

a _ _ _
a a _ _
a a a _
a a a a
a a a a

My code:我的代码:

val df = spark.read.text("data.txt")
val result = df.columns.foldLeft(df){(newdf, colname) => 
   newdf.withColumnRenamed(colname, colname.replace("a", "_"))
} 
result .show 

This resembles a homework-style problem, so I will help guide you based on your provided code and try to lead you on the right path here.这类似于家庭作业式的问题,因此我将根据您提供的代码帮助指导您,并尝试在此处引导您走上正确的道路。

Your current code is only changing the name of the columns.您当前的代码只是更改列的名称。 In this case, the column name "value" is being changed to "v_lue".在这种情况下,列名“value”将更改为“v_lue”。 You want to change the actual records themselves.您想要更改实际记录本身。

First, you want to read this data into an RDD.首先,您要将这些数据读入 RDD。 It can be done with a dataframe, but being able to map on the row strings instead of Row objects might make this easier to understand conceptually.它可以使用数据框完成,但能够映射行字符串而不是 Row 对象可能会使这在概念上更容易理解。 I'll get you started.我会让你开始。

val data = sc.textFile("data.txt")

Data will be an RDD of strings, where each element is a line in the data file.数据将是字符串的 RDD,其中每个元素是数据文件中的一行。

We're going to want to map this data to some new data, and transform each row.我们将要将此数据映射到一些新数据,并转换每一行。

data.map(row => {
   // transform each row here
})

Inside this map we make some change to row, which is a string.在这个地图中,我们对 row 进行了一些更改,它是一个字符串。 The code inside applies to every string in the RDD.里面的代码适用于 RDD 中的每个字符串。

You will probably want to split the row to get an array of strings, so that you can count how many occurrences of 'a' there are.您可能希望拆分该行以获得一个字符串数组,以便您可以计算 'a' 出现的次数。 Depending on the size of the array, you will want to create a new string and output that from this map.根据数组的大小,您需要创建一个新字符串并从该映射中输出该字符串。 If there are fewer 'a's than n, you will probably want to create a string with enough '_'s.如果 'a's 少于 n,您可能希望创建一个包含足够多 '_'s 的字符串。 If there are too many, you will probably want to return a string with the correct number.如果太多,您可能希望返回一个带有正确数字的字符串。

Hope this helps.希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM