简体   繁体   English

根据一些键值(pyspark)从RDD创建多个Spark DataFrame

[英]Create multiple Spark DataFrames from RDD based on some key value (pyspark)

I have some text files containing JSON objects (one object per line). 我有一些包含JSON对象的文本文件(每行一个对象)。 Example: 例:

{"a": 1, "b": 2, "table": "foo"}
{"c": 3, "d": 4, "table": "bar"}
{"a": 5, "b": 6, "table": "foo"}
...

I want to parse the contents of text files into Spark DataFrames based on the table name. 我想根据表名称将文本文件的内容解析为Spark DataFrames。 So in the example above, I would have a DataFrame for "foo" and another DataFrame for "bar". 因此,在上面的示例中,我将为“ foo”提供一个数据框,为“ bar”提供另一个数据框。 I have made it as far as grouping the lines of JSON into lists inside of an RDD with the following (pyspark) code: 我已经使用以下(pyspark)代码将JSON行分组到RDD内的列表中:

text_rdd = sc.textFile(os.path.join("/path/to/data", "*"))
tables_rdd = text_rdd.groupBy(lambda x: json.loads(x)['table'])

This produces an RDD containing a list of tuples with the following structure: 这将产生一个RDD,其中包含具有以下结构的元组列表:

RDD[("foo", ['{"a": 1, "b": 2, "table": "foo"}', ...],
    ("bar", ['{"c": 3, "d": 4, "table": "bar"}', ...]]

How do I break this RDD into a DataFrame for each table key? 如何将每个表键的RDD分解为一个DataFrame?

edit: I tried to clarify above that there are multiple lines in a single file containing information for a table. 编辑:我试图在上面澄清一下,单个文件中包含表信息的多行。 I know that I can call .collectAsMap on the "groupBy" RDD that I have created, but I know that this will consume a sizeable amount of RAM on my driver. 我知道我可以在已创建的“ groupBy” RDD上调用.collectAsMap,但是我知道这会在驱动程序上消耗大量RAM。 My question is: is there a way to break the "groupBy" RDD into multiple DataFrames without using .collectAsMap? 我的问题是:有没有一种方法可以在不使用.collectAsMap的情况下将“ groupBy” RDD分解为多个DataFrame?

You can split it efficiently into parquet partitions: First we'll convert it into dataframe: 您可以将其有效地拆分为实木复合地板分区:首先,我们将其转换为数据帧:

text_rdd = sc.textFile(os.path.join("/path/to/data", "*"))
df = spark.read.json(text_rdd)
df.printSchema()
    root
     |-- a: long (nullable = true)
     |-- b: long (nullable = true)
     |-- c: long (nullable = true)
     |-- d: long (nullable = true)
     |-- table: string (nullable = true)

Now we can write it: 现在我们可以编写它:

df.write.partitionBy('table').parquet([output directory name])

If you list the content of [output directory name] , you'll see as many partitions as there are distinct values of table : 如果列出[output directory name]的内容,则将看到与table不同值一样多的分区:

hadoop fs -ls [output directory name]

    _SUCCESS
    table=bar/
    table=foo/

If you want to keep each table's columns only, you can do this (assuming the full list of columns appear whenever the table appears in the file) 如果只想保留每个表的列,则可以执行此操作(假设只要表出现在文件中,列的完整列表就会出现)

import ast
from pyspark.sql import Row
table_cols = spark.createDataFrame(text_rdd.map(lambda l: ast.literal_eval(l)).map(lambda l: Row(
        table = l["table"], 
        keys = sorted(l.keys())
    ))).distinct().toPandas()
table_cols = table_cols.set_index("table")
table_cols.to_dict()["keys"]

    {u'bar': [u'c', u'd', u'table'], u'foo': [u'a', u'b', u'table']}

Here are the steps: 步骤如下:

  1. Map each text string to json. 将每个文本字符串映射到json。

     jsonRdd = sc.textFile(os.path.join("/path/to/data", "*")).map (.....) 
  2. Get all distinct table names to driver. 将所有不同的表名获取给驱动程序。

     tables = jsonRdd.map(<extract table name only from json object >).distinct().collect() 
  3. Iterate through each (step 2) tables and filter main jsonRdd to create rdd for individual table. 遍历每个表(第2步)并过滤主jsonRdd以为单个表创建rdd。

     tablesRDD=[] for table in tables: # categorize each main rdd record based on table name. # Compare each json object table element with for loop table string and on successful match return true. output.append(jasonRdd.filter(lambda jsonObj: jsonObj['table'] == table)) 

I am not python developer so exact code snippet might not work as is. 我不是python开发人员,因此确切的代码段可能无法正常运行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM