简体   繁体   中英

Invalid syntax error: Building decision tree with Python and Spark, Churn prediction

As title suggests: I am working on a decision tree model for churn prediction. I am a total beginner of data science, python and spark.

Combined with all the examples from the lectures and online documentations, I managed to come up with a Decision tree model. The only issue is, the error calculation funkction is giving me a syntax error.

Basically, the data I use for the model looks like this:

[LabeledPoint(0.0, [1031.0,947.0,0.333333333333,10.9933333333,10.3,12.0,1.33333333333,10.0133333333,83.6666666667,5.86,55.69,0.596666666667,10.3333333333,0.666666666667,0.0,0.0,0.0,0.666666666667,23.3333333333,2.88333333333,25.0,0.666666666667,0.0,0.0,0.0,0.666666666667,135.333333333,4.44,0.06,0.333333333333,16.3333333333,0.98,0.333333333333,57.6666666667,3.46,0.333333333333,0.0,0.0,0.333333333333,14.0,0.0,0.0,0.0,0.0,0.0,0.0,1307.0,5.66666666667,22.0166666667,130.48,0.0,65.3333333333,0.0,287.333333333,34.0,113.666666667,0.0,0.0,0.0,1.0,1.0,0.0,1.0]),
LabeledPoint(0.0, [4231.0,951.0,1.33333333333,27.5466666667,6.45,22.0,1.0,12.0133333333,46.3333333333,6.45,47.15,1.32333333333,8.81,1.33333333333,0.0,0.0,0.0,1.33333333333,31.6666666667,6.4,42.6566666667,1.33333333333,0.0,0.0,0.0,1.33333333333,0.666666666667,0.0,0.0,57.0,0.0,0.0,57.0,0.0,0.0,57.0,0.0,0.0,57.0,10.6666666667,0.0,0.0,0.0,0.0,0.0,0.0,1307.0,4.0,32.0266666667,156.966666667,0.0,145.43,0.0,1.66666666667,0.0,0.333333333333,0.0,0.0,0.0,1.0,1.0,0.0,1.0]),
LabeledPoint(0.0, [5231.0,523.0,0.666666666667,14.62,1.1,1307.0,0.0,0.0,14.3333333333,1.1,7.57333333333,0.726666666667,4.84,0.666666666667,0.0,0.0,0.0,0.666666666667,8.33333333333,0.323333333333,2.15666666667,0.666666666667,0.0,0.0,0.0,0.666666666667,0.0,0.0,0.0,1307.0,0.0,0.0,1307.0,0.0,0.0,1307.0,0.0,0.0,1307.0,8.33333333333,0.0,0.0,0.0,0.0,0.0,0.0,1307.0,0.0,0.0,47.33,0.0,10.3566666667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0]),
LabeledPoint(0.0, [6031.0,741.0,7.0,5.38666666667,2.13,58.0,0.333333333333,4.0,21.3333333333,1.35333333333,11.2966666667,0.48,3.2,8.33333333333,0.666666666667,0.0,0.0,8.33333333333,11.3333333333,0.453333333333,3.03,8.33333333333,1.0,0.0133333333333,0.166666666667,8.33333333333,2.33333333333,0.776666666667,0.363333333333,23.0,1.33333333333,0.08,23.0,0.0,0.0,23.0,0.333333333333,0.03,23.0,9.33333333333,0.666666666667,1.33333333333,0.0,0.0,0.0,0.0,1307.0,1.33333333333,16.0,61.25,3.31666666667,10.94,3.65,11.3333333333,7.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0]),
LabeledPoint(0.0, [8831.0,840.0,5.33333333333,2.21,2.76,35.6666666667,0.666666666667,4.0,66.3333333333,2.76,17.7466666667,0.283333333333,1.20666666667,5.33333333333,0.0,0.0,0.0,5.33333333333,42.6666666667,2.43333333333,16.2166666667,5.33333333333,1.0,0.0,0.0,5.33333333333,1.0,0.0,0.0,23.0,0.0,0.0,23.0,0.666666666667,0.0,23.0,0.0,0.0,23.0,6.33333333333,0.0,1.33333333333,0.0,0.0,0.0,0.0,1307.0,1.66666666667,10.0,62.6333333333,0.0,56.7833333333,0.0,4.33333333333,0.666666666667,2.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0])]

And then I use the code provided in Spark documentation for decision trees:

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a DecisionTree model.
#  Empty categoricalFeaturesInfo indicates all features are continuous.
model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                 impurity='gini', maxDepth=5, maxBins=32)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
print('Test Error = ' + str(testErr))
print('Learned classification tree model:')
print(model.toDebugString())

And

testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())

is giving me the error of:

  File "<ipython-input-70-e37b435ea51d>", line 1
testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
                                             ^
SyntaxError: invalid syntax

In case someone would like to see the whole code, I put it here in my Dropbox

I have no idea, why it gives me this error. Seems like this line is working fine for other people. So I am afraid it could be related to steps before the model creation..

I would really appreciate your help!

lambda (v, p) is only valid Python syntax for Python 2.7 and below. You're probably using Python 3, where tuple parameter packing is no longer allowed .

I believe the 3.X-compatible version would look like:

testErr = labelsAndPredictions.filter(lambda seq: seq[0] != seq[1]).count() / float(testData.count())

lambda expressions don't need parenthesis when having multiple arguments, so lambda (v, p): should be lambda v, p:

lambda (x,y): is a valid syntax only in Python 2. In Python 3 it's a SyntaxError .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM