简体   繁体   English

Spark:创建一个csv文件(必须使用scala和dataframe)

[英]Spark: create a csv file(must use scala and dataframe)

I'm learning scala and dataframe recently and I came up a problem.我最近在学习scaladataframe ,但遇到了一个问题。 It is about dataframe things.这是关于dataframe事情。 It must be solved using Scala and Dataframe , but NOT SparkSQL .它必须使用ScalaDataframe解决,而不是SparkSQL

Problem:问题:

  1. Create a csv file with 4 columns (person, class, subject, score)for a school and put some random data into the csv, each person must have "Maths", "English", "Art" and plus some other subjects, and at least there are 3 classes.为学校创建一个包含 4 列(人、班级、科目、分数)的csv文件,并将一些随机数据放入 csv,每个人必须有“数学”、“英语”、“艺术”以及其他一些科目,以及至少有3个班级。

  2. Write a Spark program to:编写一个 Spark 程序来:

    • read a csv file读取一个 csv 文件

    • show the full data table显示完整的数据表

    • show how many persons per class显示每班有多少人

    • show the person and his score with the highest score in "Maths"显示“数学”中得分最高的人和他的分数

I have tried to solve it and googled it, but what I came up is about using SQL to resolve it and also SQL is the first solution given by google.我试图解决它并用谷歌搜索它,但我想到的是使用 SQL 来解决它,而且 SQL 是谷歌给出的第一个解决方案。

I really don't know how to do it via Spark and Dataframe but NOT SparkSQL, though the tutorial said it was a very easy question:(我真的不知道如何通过 Spark 和 Dataframe 而不是 SparkSQL 来做到这一点,尽管教程说这是一个非常简单的问题:(

Could anyone help me with it, like write an example for me or give me an example?任何人都可以帮助我,比如为我写一个例子或给我一个例子? thank you so much.太感谢了。 I will very appreciate it.我会很感激的。

Sample csv file :示例 csv 文件:

+-------+-------+---------+-------+   
| name  | class | subject | marks |
+-------+-------+---------+-------+
| ab    | 12    | Maths   | 72    |
+-------+-------+---------+-------+
| abc   | 12    | Maths   | 88    |
+-------+-------+---------+-------+
| abcd  | 11    | Arts    | 92    |
+-------+-------+---------+-------+
| abcde | 12    | English | 88    |
+-------+-------+---------+-------+
| bc    | 11    | Maths   | 99    |
+-------+-------+---------+-------+
| bcd   | 12    | English | 55    |
+-------+-------+---------+-------+
| bcde  | 11    | English | 77    |
+-------+-------+---------+-------+
| axax  | 10    | Maths   | 83    |
+-------+-------+---------+-------+
| amam  | 10    | English | 65    |
+-------+-------+---------+-------+
| arar  | 10    | Arts    | 66    |
+-------+-------+---------+-------+

  1. Read csv file:读取csv文件:
val df = spark.read.option("inferSchema","true").option("header","true").csv(filePath)
  1. Show dataframe :显示数据框:
    df.show()

  2. Show how many persons per class :显示每班人数:
    df.groupBy("class").count.show

  3. Show the person and his score with the highest score in "Maths": df.filter(col("subject")==="Maths").orderBy(desc("marks")).limit(1).show显示“数学”中得分最高的人和他的分数: df.filter(col("subject")==="Maths").orderBy(desc("marks")).limit(1).show
    Moreover for the last question we can also filter out the class also.此外,对于最后一个问题,我们也可以过滤掉类。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM