Pyspark数据框操作的单元测试用例

Question

I have written some code in python with sql context ie pyspark to perform some operations on csv by converting them into pyspark dataframes(df operations such as pre-processing,renaming column names,creating new column and appending them to same dataframe and so on). 我已经在python中使用sql上下文编写了一些代码，即pyspark通过将它们转换为pyspark数据帧来对csv执行一些操作（df操作，例如预处理，重新命名列名，创建新列并将它们附加到相同的数据帧等）。 I wish to write unit test cases for it. 我希望为此编写单元测试用例。 I have no idea of writing unit testcases on dataframes. 我不知道在数据帧上编写单元测试用例。 Can anyone help me out how to write unit testcases on dataframes in pyspark? 谁能帮助我如何在pyspark的数据帧上编写单元测试用例？ Or give me some sort of sources for testcases on dataframes? 还是给我一些关于数据帧测试用例的资源？

Answer 1

Dataframes are not different from anything else in pyspark land. 数据帧与pyspark领域中的其他内容没有不同。 You can start by looking at Python section of spark-testing-base . 您可以先查看spark-testing-base的Python部分。 There are several interesting projects that have dataframe tests, so you can start at peeking how they do it: Sparkling Pandas is one, and here is another example . 有几个有趣的项目已经过数据框测试，因此您可以开始了解它们的工作方式：闪闪发光的熊猫是其中一个，这里是另一个示例。 There is also find-spark that will help locate your spark executable context. 还有一个find-spark ，它将帮助您找到spark可执行文件上下文。 But the basic idea is to setup path properly before you start your test: 但是基本的想法是在开始测试之前正确设置路径：

def add_pyspark_path():
    """
    Add PySpark to the PYTHONPATH
    Thanks go to this project: https://github.com/holdenk/sparklingpandas
    """
    import sys
    import os
    try:
        sys.path.append(os.path.join(os.environ['SPARK_HOME'], "python"))
        sys.path.append(os.path.join(os.environ['SPARK_HOME'],
            "python","lib","py4j-0.9-src.zip"))
    except KeyError:
        print "SPARK_HOME not set"
        sys.exit(1)

add_pyspark_path() # Now we can import pyspark

And normally you would have one base test case class: 通常，您将拥有一个基本的测试用例类：

import logging

from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext, HiveContext

def quiet_py4j():
    """ turn down spark logging for the test context """
    logger = logging.getLogger('py4j')
    logger.setLevel(logging.WARN)

class SparkTestCase(unittest.TestCase):
    @classmethod
    def setUpClass(cls):
        quiet_py4j()

        # Setup a new spark context for each test
        conf = SparkConf()
        conf.set("spark.executor.memory","1g")
        conf.set("spark.cores.max", "1")
        #conf.set("spark.master", "spark://192.168.1.2:7077")
        conf.set("spark.app.name", "nosetest")
        cls.sc = SparkContext(conf=conf)
        cls.sqlContext = HiveContext(cls.sc)

    @classmethod
    def tearDownClass(cls):
        cls.sc.stop()

Pyspark数据框操作的单元测试用例

问题描述

1 个解决方案

解决方案1
2 2016-04-15 17:29:27

Pyspark数据框操作的单元测试用例

问题描述

1 个解决方案

解决方案1 2 2016-04-15 17:29:27

解决方案1
2 2016-04-15 17:29:27