Loading a table into PySpark Dataframe with limits

Question

Is it possible in PySpark to load a certain number of data into the dataframe while reading it from the database? By certain number, I mean if a limit could be given to the sqlContext when reading it from the database so that the whole table doesn't have to be read through(as it is very expensive to iterate through 750K rows).

Here's the code that I'm currently using to filter out the required data. I have used Python3.7 and Cassandra DB apart from PySpark:

def connect_cassandra():
    spark = SparkSession.builder \
      .appName('SparkCassandraApp') \
      .config('spark.cassandra.connection.host', 'localhost') \
      .config("spark.driver.memory","15g") \
      .config("spark.executor.memory","15g") \
      .config("spark.driver.cores","4") \
      .config("spark.num.executors","6") \
      .config("spark.executor.cores","4") \
      .config('spark.cassandra.connection.port', '9042') \
      .config('spark.cassandra.output.consistency.level','ONE') \
      .master('local[*]') \
      .getOrCreate()

    sqlContext = SQLContext(spark)
    return sqlContext

def total_bandwidth(start_date, end_date):
    sqlContext = connect_cassandra()

    try:
        df = sqlContext \
          .read \
          .format("org.apache.spark.sql.cassandra") \
          .options(table="user_info", keyspace="acrs") \
          .load()
    except Exception as e:
        print(e)

    rows = df.where(df["created"] > str(start_date)) \
            .where(df["created"] < str(end_date)) \
            .groupBy(['src_ip', 'dst_ip']) \
            .agg(_sum('data').alias('total')) \
            .collect()

    data_dict = []
    for row in rows:
        src_ip = row['src_ip']
        dst_ip = row['dst_ip']
        data = row['total']
        data = {'src_ip' : src_ip, 'dst_ip' : dst_ip, 'data' : data}
        data_dict.append(data)

    print(data_dict)

As you guys can see, I'm trying to filter out the data using the start_date and end_date . But this takes too much time resulting in slow operations. I'd like to know if there are any DataFrameReader Options available while loading the table into the dataframe, so that the time taken reduces(exponentially preferred :p ).

I read the Data-Frame-Reader documentation and found option(String key, String value) but these options are un-documented so it's not possible to find out what options there are for Cassandra Database and how can they be used.

Answer 1

Your main problem is that you are using append method. Since you have big number of rows in your dataframe it's really inefficient. I'd rather use dedicated pyspark methods to achieve desired result.

I created some temp dataframe (I assume that you have created SparkSession) with 1 million rows on my local machine

>>> import pandas as pd

>>> n = 1000000
>>> df = spark.createDataFrame(
        pd.DataFrame({
            'src_ip': n * ['192.160.1.0'],
            'dst_ip': n * ['192.168.1.1'],
            'total': n * [1]
        })
    )

>>> df.count()
1000000

Let's select only desired columns from your table.

>>> import pyspark.sql.functions as F
>>> df.select('src_ip', 'dst_ip', F.col('total').alias('data')).show(5)
+-----------+-----------+----+
|     src_ip|     dst_ip|data|
+-----------+-----------+----+
|192.160.1.0|192.168.1.1|   1|
|192.160.1.0|192.168.1.1|   1|
|192.160.1.0|192.168.1.1|   1|
|192.160.1.0|192.168.1.1|   1|
|192.160.1.0|192.168.1.1|   1|
+-----------+-----------+----+
only showing top 5 rows

At the end let's create desired list of data dictionaries. The easiest way to collect all the data is to use list comprehensions. Once we select columns that we want to combine into dictionary we can use toDict() method on each DataFrame row.

Nitpick:

If you want collect all values use collect() method on DataFrame.
If you don't know exact size of the DataFrame you can use take(n) method which will return n elements from your DataFrame.

>>> dict_list = [i.asDict() for i in df.select('src_ip', 'dst_ip', F.col('total').alias('data')).take(5)]
>>> dict_list
[{'data': 1, 'dst_ip': '192.168.1.1', 'src_ip': '192.160.1.0'},
 {'data': 1, 'dst_ip': '192.168.1.1', 'src_ip': '192.160.1.0'},
 {'data': 1, 'dst_ip': '192.168.1.1', 'src_ip': '192.160.1.0'},
 {'data': 1, 'dst_ip': '192.168.1.1', 'src_ip': '192.160.1.0'},
 {'data': 1, 'dst_ip': '192.168.1.1', 'src_ip': '192.160.1.0'}]

Loading a table into PySpark Dataframe with limits

Question

1 answers

solution1
1 2019-07-26 11:07:22

Loading a table into PySpark Dataframe with limits

Question

1 answers

solution1 1 2019-07-26 11:07:22

solution1
1 2019-07-26 11:07:22