[英]Inventory Profit calculation in spark-scala
一個包含 5 元組(PRODUCT_ID, TRANSACTION_TYPE, QUANTITY, PRICE, DATE)
。 Transaction_Type
可以是“買入”或“賣出”之一。 Quantity
是購買或出售產品的實例數,以Date
所示的Price
Date
。
售出的產品與現有庫存相抵消,這也是該庫存的最早實例。
凈利潤的計算方法是將已售庫存與最早的已購買庫存進行抵消,如果這不能完全解決問題,則使用下一個已購買庫存,依此類推。
例如,請考慮下表中的值:
1, Buy, 10, 100.0, Jan 1
2, Buy, 20, 200.0, Jan 2
1, Buy, 15, 150.0, Jan 3
1, Sell, 5, 120.0, Jan 5
1, Sell, 10, 125.0, Jan 6
HDFS 上已經存儲了數百個具有上述架構的文件。
那么利潤計算應該如下進行:
下面是代碼片段。 但它無法獲得 NullPointer 異常。 有什么更好的建議嗎?
import org.apache.spark.SparkContext._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql._
import org.apache.spark.rdd._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
case class Inventory(PRODUCT_ID: Int, TRANSACTION_TYPE: String, QUANTITY: Long, PRICE: Double, DATE: String)
object MatchingInventory{
def main(args:Array[String])= {
val conf = new SparkConf().setAppName("XYZ")
val sc = new SparkContext(conf)
val sqlcontext = new SQLContext(sc)
// Create a schema RDD of Inventory objects from the data that has any number of text file.
import sqlcontext.implicits._
val dfInvent= sc.textFile("Invent.txt")
.map(_.split(","))
.map(p => Inventory(p(0).trim.toInt, p(1).trim, p(2).trim.toLong, p(3).trim.toDouble, p(4).trim))
.toDF().cache()
dfInvent.show()
val idDF = dfInvent.map{row => row.getInt(0)}.distinct
//idDF.show()
val netProfit = sc.accumulator(0.0)
idDF.foreach{id =>
val sellDF = dfInvent.filter((dfInvent("PRODUCT_ID").contains(id)) && (dfInvent("TRANSACTION_TYPE").contains("Sell")))
val buyDF = dfInvent.filter((dfInvent("PRODUCT_ID").contains(id)) && (dfInvent("TRANSACTION_TYPE").contains("Buy")))
var soldQ:Long = sellDF.map{row => row.getLong(2)}.reduce(_+_)
var sellPrice:Double = sellDF.map{row => row.getLong(2)*row.getDouble(3)}.reduce(_+_) //reduce sends the result back to driver
var profit:Double = 0.0
// profit for each bought item
buyDF.foreach{row =>
if((soldQ > 0) && (soldQ < row.getLong(2))){profit += sellPrice -(soldQ*row.getDouble(3));soldQ = 0}
else if((soldQ > 0) && (soldQ > row.getLong(2))){profit += sellPrice - (row.getLong(2)*row.getDouble(3));soldQ = soldQ - row.getLong(2)}
else{}}
netProfit += profit}
println("Inventory net Profit" + netProfit)
}
}
我試過這樣的事情。 這是一個可行的代碼,唯一的問題是我在后期使用 collect 來同步買賣,這將導致大數據的內存問題。
from pyspark.sql import SQLContext
from pyspark import SparkConf
from pyspark import SparkContext
import sys
from pyspark.sql.functions import *
if __name__ == "__main__":
sc = SparkContext()
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='false', inferschema='true').load('test.csv')
df = df.withColumn("C1", ltrim(df.C1))
df.registerTempTable("tempTable")
df = sqlContext.sql("select * from tempTable order by C0")
dt = df.map(lambda s: (str(s[0])+'-'+ s[1], str(s[2]) + ',' +str(s[3])))
dt = dt.reduceByKey(lambda a, b : a + '-' + b)
ds = dt.collect()
dicTran = {}
for x in ds:
key = (x[0].split('-'))[0]
tratype = (x[0].split('-'))[1]
val = {}
if key in dicTran:
val = dicTran[key]
val[tratype] = x[1]
dicTran[key] = val
profit = 0
for key, value in dicTran.iteritems():
if 'Sell' in value:
buy = value['Buy']
sell = value['Sell']
ls = sell.split('-')
sellAmount = 0
sellquant = 0
for x in ls:
y = x.split(',')
sellAmount= sellAmount + float(y[0]) * float(y[1])
sellquant = sellquant + float(y[0])
lb = buy.split('-')
for x in lb:
y = x.split(',')
if float(y[0]) >= sellquant:
profit += sellAmount - sellquant * float(y[1])
else:
sellAmount -= float(y[0]) * float(y[1])
sellquant -= float(y[0])
print 'profit', profit
#
這是我認為的邏輯
1) 對於所有相同的 ID 和交易類型,我通過分隔符連接數量和價格 2) 然后我收集並拆分它們以計算利潤
我知道這會在使用 collect 時在大型數據集上崩潰,但沒有比這更好的了。 我也會嘗試你的解決方案。
所以在這里我想出了一個解決方案
import org.apache.spark.SparkContext._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql._
import org.apache.spark.rdd._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import java.text.SimpleDateFormat
import java.sql.Date
import scala.math.Ordering
//Defining Schema
case class Inventory(PRODUCT_ID: Int, TRANSACTION_TYPE: String, QUANTITY: Long, PRICE: Double, pDate:java.sql.Date)
object MatchingInventory{
def main(args:Array[String])= {
val conf = new SparkConf().setAppName("XYZ")
val sc = new SparkContext(conf)
val sqlcontext = new SQLContext(sc)
import sqlcontext.implicits._
val format = new SimpleDateFormat("MMM d")
//Read data from directory which has multple files
val dfInvent= sc.textFile("data/*.txt")
.map(_.split(","))
.map(p => Inventory(p(0).trim.toInt, p(1).trim, p(2).trim.toLong, p(3).trim.toDouble, new Date(format.parse(p(4)).getTime)))
.cache()
def calculateProfit(data:Iterable[Inventory]):Double = {
var soldQ:Long = 0
var sellPrice:Double = 0
var profit:Double = 0
val v = data
for(i <- v ){
if(i.TRANSACTION_TYPE == "Sell")
{
soldQ = soldQ + i.QUANTITY
profit = profit+ i.PRICE*i.QUANTITY
}
}
for(i <- v){
if(i.TRANSACTION_TYPE == "Buy")
{
if((soldQ > 0) && (soldQ < i.QUANTITY || soldQ == i.QUANTITY)){profit = profit -(soldQ*i.PRICE);soldQ = 0}
else if((soldQ > 0) && (soldQ > i.QUANTITY)){profit = profit - (i.QUANTITY*i.PRICE);soldQ = soldQ - i.QUANTITY}
else{}
}
}
profit
}
val key: RDD[((Int), Iterable[Inventory])] = dfInvent.keyBy(r => (r.PRODUCT_ID)).groupByKey
val values: RDD[((Int), List[Inventory])] = key.mapValues(v => v.toList.sortBy(_.pDate.getTime))
val pro = values.map{ case(k,v) => (k, calculateProfit(v))}
val netProfit = pro.map{ case(k,v) => v}.reduce(_+_)
println("Inventory NetProfit" + netProfit)
}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.