spark-scala中的库存利润计算

Question

A table of 5-tuples (PRODUCT_ID, TRANSACTION_TYPE, QUANTITY, PRICE, DATE) .一个包含 5 元组(PRODUCT_ID, TRANSACTION_TYPE, QUANTITY, PRICE, DATE) 。 Transaction_Type could be one of "Buy" or "Sell". Transaction_Type可以是“买入”或“卖出”之一。 Quantity is the number of instances of the product is bought or sold, for the Price indicated on the Date . Quantity是购买或出售产品的实例数，以Date所示的Price Date 。

A product that is sold is offset against the inventory already in hand, and that too the earliest instance of that inventory.售出的产品与现有库存相抵消，这也是该库存的最早实例。

Net Profit is calculated by offsetting the Sold inventory against the earliest Bought inventory, and if that doesn't fully address it, then use the next Bought inventory, and so on.净利润的计算方法是将已售库存与最早的已购买库存进行抵消，如果这不能完全解决问题，则使用下一个已购买库存，依此类推。

For instance, consider the following table values:例如，请考虑下表中的值：

1, Buy, 10, 100.0, Jan 1

2, Buy, 20, 200.0, Jan 2

1, Buy, 15, 150.0, Jan 3

1, Sell, 5, 120.0, Jan 5

1, Sell, 10, 125.0, Jan 6

There are hundreds of files stored already on HDFS having the schema shown above. HDFS 上已经存储了数百个具有上述架构的文件。

Then the profit calculation should work as follows:那么利润计算应该如下进行：

When Product 1 is sold on Jan 5, those 5 units should offset against the Jan 1 Buy transaction first (resulting in a profit of 5*(120.0-100.0)).当产品 1 在 1 月 5 日售出时，这 5 个单位应首先抵消 1 月 1 日的购买交易（产生 5*(120.0-100.0) 的利润）。
Then when Product 1 is further sold on Jan 6, since the units sold are more than what remains from Jan1 Buy lot, Jan 3's Buy lot can be considered for the remainder.然后当产品 1 在 1 月 6 日进一步销售时，由于售出的单位比 1 月 1 的购买批次剩余的数量多，因此可以考虑将 1 月 3 的购买批次用于剩余部分。
That is, the profit from selling Product 1 on Jan 6 is 5*(125.0-100.0)+5*(125.00-150.0).也就是说，1 月 6 日销售产品 1 的利润为 5*(125.0-100.0)+5*(125.00-150.0)。
So, the profit value for Jan 6 transaction is = 5 * (25) + 5 * (-25 ) = 125 - 125 = 0. and the net profit until Jan 6 is 100 (from Jan 5 transaction) + 0 (from Jan 6 transaction) = 100.因此，1 月 6 日交易的利润值为 = 5 * (25) + 5 * (-25 ) = 125 - 125 = 0。1 月 6 日的净利润为 100（来自 1 月 5 日交易）+ 0（来自 1 月6 笔交易）= 100。
Calculate the final profit as of the last date present in that data.计算截至该数据中最后一个日期的最终利润。

Below is the code snippet.下面是代码片段。 But It does not work getting NullPointer Exception.但它无法获得 NullPointer 异常。 Any better suggestion ?有什么更好的建议吗？

import org.apache.spark.SparkContext._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql._
import org.apache.spark.rdd._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row


case class Inventory(PRODUCT_ID: Int, TRANSACTION_TYPE: String, QUANTITY: Long, PRICE: Double, DATE: String)

object MatchingInventory{
    def main(args:Array[String])= {

        val conf = new SparkConf().setAppName("XYZ")
        val sc = new SparkContext(conf)


        val sqlcontext = new SQLContext(sc)
        // Create a schema RDD of Inventory objects from the data that has any number of text file.
        import sqlcontext.implicits._
        val dfInvent= sc.textFile("Invent.txt")
        .map(_.split(","))
        .map(p => Inventory(p(0).trim.toInt, p(1).trim, p(2).trim.toLong, p(3).trim.toDouble, p(4).trim))
        .toDF().cache()
        dfInvent.show()

        val idDF =  dfInvent.map{row => row.getInt(0)}.distinct 
        //idDF.show()
        val netProfit = sc.accumulator(0.0)
        idDF.foreach{id =>
        val sellDF = dfInvent.filter((dfInvent("PRODUCT_ID").contains(id)) && (dfInvent("TRANSACTION_TYPE").contains("Sell")))
        val buyDF = dfInvent.filter((dfInvent("PRODUCT_ID").contains(id)) && (dfInvent("TRANSACTION_TYPE").contains("Buy")))    
         var soldQ:Long = sellDF.map{row => row.getLong(2)}.reduce(_+_) 
         var sellPrice:Double = sellDF.map{row => row.getLong(2)*row.getDouble(3)}.reduce(_+_) //reduce sends the result back to driver
         var profit:Double = 0.0
         // profit for each bought item
         buyDF.foreach{row => 
                           if((soldQ > 0) && (soldQ < row.getLong(2))){profit += sellPrice -(soldQ*row.getDouble(3));soldQ = 0}
                           else if((soldQ > 0) && (soldQ > row.getLong(2))){profit += sellPrice - (row.getLong(2)*row.getDouble(3));soldQ = soldQ - row.getLong(2)}
                                else{}} 
        netProfit += profit}
        println("Inventory net Profit" + netProfit)
    }

}

Answer 1

I tried something like this.我试过这样的事情。 This is a workable code, only issue is I am using collect at the later stage to sync between buy and sell which will lead to memory issues for large data.这是一个可行的代码，唯一的问题是我在后期使用 collect 来同步买卖，这将导致大数据的内存问题。

from pyspark.sql import  SQLContext
from pyspark import SparkConf
from pyspark import SparkContext
import sys
from pyspark.sql.functions import *

if __name__ == "__main__":

    sc = SparkContext()

    sqlContext = SQLContext(sc)
    df = sqlContext.read.format('com.databricks.spark.csv').options(header='false', inferschema='true').load('test.csv')

    df = df.withColumn("C1", ltrim(df.C1))

    df.registerTempTable("tempTable")
    df = sqlContext.sql("select * from tempTable order by C0")

    dt = df.map(lambda s: (str(s[0])+'-'+ s[1], str(s[2]) + ',' +str(s[3])))
    dt = dt.reduceByKey(lambda a, b : a + '-' + b)

    ds = dt.collect()

    dicTran = {}
    for x in ds:
        key = (x[0].split('-'))[0]
        tratype = (x[0].split('-'))[1]


        val = {}
        if key in dicTran:
            val = dicTran[key]

        val[tratype] = x[1]
        dicTran[key] = val

    profit = 0

    for key, value in dicTran.iteritems():
        if 'Sell' in value:
            buy = value['Buy']
            sell = value['Sell']

            ls = sell.split('-')
            sellAmount = 0
            sellquant = 0
            for x in ls:
                y = x.split(',')
                sellAmount= sellAmount + float(y[0]) * float(y[1])
                sellquant = sellquant + float(y[0])

            lb = buy.split('-')
            for x in lb:
                y = x.split(',')

                if float(y[0]) >= sellquant:
                    profit += sellAmount - sellquant * float(y[1])
                else:
                    sellAmount -= float(y[0]) * float(y[1])
                    sellquant -= float(y[0])

    print 'profit', profit    



    #

Here is the logic I thought这是我认为的逻辑

1) For all same ids and transaction type I concat the quantity and price via a separator 2) Then I collect and split them to calculate the profit 1) 对于所有相同的 ID 和交易类型，我通过分隔符连接数量和价格 2) 然后我收集并拆分它们以计算利润

I know this will crash on large data sets as collect is used but could not thing of anything better.我知道这会在使用 collect 时在大型数据集上崩溃，但没有比这更好的了。 I will try out your solution as well.我也会尝试你的解决方案。

Answer 2

So here I come up with a solution所以在这里我想出了一个解决方案

import org.apache.spark.SparkContext._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql._
import org.apache.spark.rdd._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import java.text.SimpleDateFormat
import java.sql.Date
import scala.math.Ordering


//Defining Schema
case class Inventory(PRODUCT_ID: Int, TRANSACTION_TYPE: String, QUANTITY: Long, PRICE: Double, pDate:java.sql.Date)


object MatchingInventory{
    def main(args:Array[String])= {

        val conf = new SparkConf().setAppName("XYZ")
        val sc = new SparkContext(conf)


        val sqlcontext = new SQLContext(sc)

        import sqlcontext.implicits._

        val format = new SimpleDateFormat("MMM d")
        //Read data from directory which has multple files
        val dfInvent= sc.textFile("data/*.txt")
        .map(_.split(","))
        .map(p => Inventory(p(0).trim.toInt, p(1).trim, p(2).trim.toLong, p(3).trim.toDouble, new Date(format.parse(p(4)).getTime)))
        .cache()

        def calculateProfit(data:Iterable[Inventory]):Double  = {
            var soldQ:Long = 0
            var sellPrice:Double = 0
            var profit:Double = 0
            val v = data

            for(i <- v ){
                if(i.TRANSACTION_TYPE == "Sell")
                {
                  soldQ = soldQ + i.QUANTITY
                  profit = profit+ i.PRICE*i.QUANTITY

                }
            }

            for(i <- v){
                if(i.TRANSACTION_TYPE == "Buy")
                {
                    if((soldQ > 0) && (soldQ < i.QUANTITY || soldQ == i.QUANTITY)){profit = profit -(soldQ*i.PRICE);soldQ = 0}
                    else if((soldQ > 0) && (soldQ > i.QUANTITY)){profit = profit - (i.QUANTITY*i.PRICE);soldQ = soldQ - i.QUANTITY}
                    else{}
                }
            }
           profit
        }

        val key: RDD[((Int), Iterable[Inventory])] = dfInvent.keyBy(r => (r.PRODUCT_ID)).groupByKey
        val values: RDD[((Int), List[Inventory])] = key.mapValues(v => v.toList.sortBy(_.pDate.getTime))


        val pro = values.map{ case(k,v) => (k, calculateProfit(v))}
        val netProfit = pro.map{ case(k,v) => v}.reduce(_+_)
        println("Inventory NetProfit" + netProfit)

    }

spark-scala中的库存利润计算

问题描述

2 个解决方案

解决方案1
0 2016-09-19 04:35:39

解决方案2
0 2017-01-11 10:30:01

spark-scala中的库存利润计算

问题描述

2 个解决方案

解决方案1 0 2016-09-19 04:35:39

解决方案2 0 2017-01-11 10:30:01

解决方案1
0 2016-09-19 04:35:39

解决方案2
0 2017-01-11 10:30:01