简体   繁体   English

添加新行到 pyspark Dataframe

[英]Add new rows to pyspark Dataframe

Am very new pyspark but familiar with pandas. I have a pyspark Dataframe我很新 pyspark 但熟悉 pandas。我有一个 pyspark Dataframe

# instantiate Spark
spark = SparkSession.builder.getOrCreate()

# make some test data
columns = ['id', 'dogs', 'cats']
vals = [
     (1, 2, 0),
     (2, 0, 1)
]

# create DataFrame
df = spark.createDataFrame(vals, columns)

wanted to add new Row (4,5,7) so it will output:想要添加新行 (4,5,7),所以它将是 output:

df.show()
+---+----+----+
| id|dogs|cats|
+---+----+----+
|  1|   2|   0|
|  2|   0|   1|
|  4|   5|   7|
+---+----+----+

As thebluephantom has already said union is the way to go.正如thebluephantom已经说过的那样联合是要走的路。 I'm just answering your question to give you a pyspark example:我只是回答你的问题给你一个pyspark的例子:

# if not already created automatically, instantiate Sparkcontext
spark = SparkSession.builder.getOrCreate()

columns = ['id', 'dogs', 'cats']
vals = [(1, 2, 0), (2, 0, 1)]

df = spark.createDataFrame(vals, columns)

newRow = spark.createDataFrame([(4,5,7)], columns)
appended = df.union(newRow)
appended.show()

Please have also a lookat the databricks FAQ: https://kb.databricks.com/data/append-a-row-to-rdd-or-dataframe.html还请查看 databricks 常见问题解答: https ://kb.databricks.com/data/append-a-row-to-rdd-or-dataframe.html

To append row to dataframe one can use collect method also.到 append 行到 dataframe 也可以使用collect方法。 collect() function converts dataframe to list and you can directly append data to list and again convert list to dataframe. collect() function 将 dataframe 转换为列表,您可以直接将 append 数据转换为列表,然后再次将列表转换为 dataframe。

my spark dataframe called df is like我的 spark dataframe 叫做df就像

+---+----+------+
| id|name|gender|
+---+----+------+
|  1|   A|     M|
|  2|   B|     F|
|  3|   C|     M|
+---+----+------+

convert this dataframe to list using collect使用 collect 将此 dataframe 转换为列表

collect_df = df.collect()
print(collect_df)

[Row(id=1, name='A', gender='M'),
 Row(id=2, name='B', gender='F'),
 Row(id=3, name='C', gender='M')]

append new row to this list append 此列表的新行

collect_df.append({"id" : 5, "name" : "E", "gender" : "F"})
print(collect_df)

[Row(id=1, name='A', gender='M'),
 Row(id=2, name='B', gender='F'),
 Row(id=3, name='C', gender='M'),
 {'id': 5, 'name': 'E', 'gender': 'F'}]

convert this list to dataframe将此列表转换为 dataframe

added_row_df = spark.createDataFrame(collect_df)
added_row_df.show()

+---+----+------+
| id|name|gender|
+---+----+------+
|  1|   A|     M|
|  2|   B|     F|
|  3|   C|     M|
|  5|   E|     F|
+---+----+------+

From something I did, using union , showing a block partial coding - you need to adapt of course to your own situation:从我所做的事情中,使用union ,显示块部分编码 - 当然你需要适应你自己的情况:

val dummySchema = StructType(
StructField("phrase", StringType, true) :: Nil)
var dfPostsNGrams2 = spark.createDataFrame(sc.emptyRDD[Row], dummySchema)
for (i <- i_grams_Cols) {
    val nameCol = col({i})
    dfPostsNGrams2 = dfPostsNGrams2.union(dfPostsNGrams.select(explode({nameCol}).as("phrase")).toDF )
 }

union of DF with itself is the way to go. DF与自身的结合是要走的路。

Another alternative would be to utilize the partitioned parquet format, and add an extra parquet file for each dataframe you want to append.另一种选择是使用分区的镶木地板格式,并为您要附加的每个数据帧添加一个额外的镶木地板文件。 This way you can create (hundreds, thousands, millions) of parquet files, and spark will just read them all as a union when you read the directory later.通过这种方式,您可以创建(数百、数千、数百万)镶木地板文件,并且当您稍后阅读目录时,spark 会将它们全部作为联合读取。

This example uses pyarrow此示例使用 pyarrow

Note I also showed how to write a single parquet (example.parquet) that isn't partitioned, if you already know where you want to put the single parquet file.注意我还展示了如何编写未分区的单个镶木地板 (example.parquet),如果您已经知道要将单个镶木地板文件放在哪里。

import pyarrow.parquet as pq
import pandas as pd

headers=['A', 'B', 'C']

row1 = ['a1', 'b1', 'c1']
row2 = ['a2', 'b2', 'c2']

df1 = pd.DataFrame([row1], columns=headers)
df2 = pd.DataFrame([row2], columns=headers)

df3 = df1.append(df2, ignore_index=True)


table = pa.Table.from_pandas(df3)

pq.write_table(table, 'example.parquet', flavor='spark')
pq.write_to_dataset(table, root_path="test_part_file", partition_cols=['B', 'C'], flavor='spark')

# Adding a new partition (B=b2/C=c3


row3 = ['a3', 'b3', 'c3']
df4 = pd.DataFrame([row3], columns=headers)

table2 = pa.Table.from_pandas(df4)
pq.write_to_dataset(table2, root_path="test_part_file", partition_cols=['B', 'C'], flavor='spark')

# Add another parquet file to the B=b2/C=c2 partition
# Note this does not overwrite existing partitions, it just appends a new .parquet file.
# If files already exist, then you will get a union result of the two (or multiple) files when you read the partition
row5 = ['a5', 'b2', 'c2']
df5 = pd.DataFrame([row5], columns=headers)
table3 = pa.Table.from_pandas(df5)
pq.write_to_dataset(table3, root_path="test_part_file", partition_cols=['B', 'C'], flavor='spark')

Reading the output afterwards之后读取输出

from pyspark.sql import SparkSession

spark = (SparkSession
         .builder
         .appName("testing parquet read")
         .getOrCreate())

df_spark = spark.read.parquet('test_part_file')
df_spark.show(25, False)

You should see something like this你应该看到这样的东西

+---+---+---+
|A  |B  |C  |
+---+---+---+
|a5 |b2 |c2 |
|a2 |b2 |c2 |
|a1 |b1 |c1 |
|a3 |b3 |c3 |
+---+---+---+

If you run the same thing end to end again, you should see duplicates like this (since all of the previous parquet files are still there, spark unions them).如果您再次端到端地运行相同的事情,您应该看到这样的重复项(因为所有以前的镶木地板文件仍然存在,火花联合它们)。

+---+---+---+
|A  |B  |C  |
+---+---+---+
|a2 |b2 |c2 |
|a5 |b2 |c2 |
|a5 |b2 |c2 |
|a2 |b2 |c2 |
|a1 |b1 |c1 |
|a1 |b1 |c1 |
|a3 |b3 |c3 |
|a3 |b3 |c3 |
+---+---+---+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM