简体   繁体   English

如何用spark python对CSV文件列求和

[英]how to sum CSV file column with spark python

i'm new in spark, and i have some data to work with, i want to sum the total of a column in a CSV file, the header of the file : ([colmun1],[colmun2],[colmun3]), what i'm trying to calculate is the sum of column3 according to column1 , (column1 is represent the date, column2 represent categorie,column3 the occurrence of one of the categories on that date, so i want to calculate the sum of all categories for each date), i have tried this code:我是 spark 新手,我有一些数据要处理,我想对 CSV 文件中的列的总数求和,即文件的标题:([colmun1],[colmun2],[colmun3]),我要计算的是column3根据column1的总和,(column1 表示日期,column2 表示类别,column3 是该日期类别之一的出现,所以我想计算所有类别的总和每个日期),我试过这个代码:

    from pyspark import SparkContext, SparkConf
    if __name__ == "__main__":
        conf = SparkConf().setAppName("sum").setMaster("local[3]")
        sc = SparkContext(conf = conf)
        line.split(",")).map(lambda line: (line[0:1]+line[3:4]))
        text_file = sc.textFile("in/fileinput.CSV")
        counts = text_file.flatMap(lambda line: line.split(",")) \
             .map(lambda line: (line[0:1],line[2:3])) \
             .reduceByKey(lambda a, b: a + b)
        counts.saveAsTextFile("out/fileoutput.txt")

thank you in advance (excuse my English)提前谢谢你(原谅我的英语)

Please try below steps to achieve the desired result.请尝试以下步骤以获得所需的结果。

  1. Read the CSV file as Dataframe.将 CSV 文件作为 Dataframe 读取。

    df = spark.read.csv("path_to_csv_file", header=True, inferSchema=True) df = spark.read.csv("path_to_csv_file", header=True, inferSchema=True)

  2. Group By Data based on column 1.根据第 1 列按数据分组。

    group_df = df.groupBy("Column_1") group_df = df.groupBy("Column_1")

  3. Take sum of 3rd column on grouped data对分组数据取第三列的总和

    result_df = group_df.agg(sum("column_3").alias("SUM")) result_df = group_df.agg(sum("column_3").alias("SUM"))

  4. Display Data result_df.show()显示数据 result_df.show()

Hope it helps.希望能帮助到你。

Note : for more information in CSV function Refer below link.注意:有关 CSV 功能的更多信息,请参阅以下链接。 https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv

Regards,问候,

Neeraj尼拉吉

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM