从pyspark中的非常大的数据框中选择随机列

Question

I have a dataframe in pyspark which has around 150 columns. 我在pyspark中有一个大约150列的数据框。 These columns are obtained from joining different tables. 这些列是通过连接不同的表获得的。 Now my requirement is to write the dataframe to a file but in a specific order like first write 1 to 50 columns then column 90 to 110 and then column 70 and 72. That is I want to select only specific columns along with rearranging them. 现在，我的要求是将数据帧写入文件，但要以特定的顺序进行，例如首先写入1至50列，然后写入90至110列，然后写入70和72列。那就是我只选择特定的列并重新排列它们。

I know one one of the way is to use df.select("give your column order") but in my case, the columns are very large and it is not possible to write each and every column name in 'select'. 我知道一种方法是使用df.select（“ give your column order”），但在我的情况下，列非常大，不可能在“ select”中写入每个列的名称。

Please tell me how can I achieve this in pyspark. 请告诉我如何在pyspark中实现这一目标。

Note- I cannot provide any sample data as the number of columns is very large and the column number is the main road blocker in my case. 注意-我无法提供任何示例数据，因为列数非常大，而列数是我的主要障碍。

Answer 1

You can create list of columns programmatically 您可以以编程方式创建列列表

first_df.join(second_df, on-'your_condition').select([column_name for column_name in first_df.columns] + [column_name for column_name in second_df.columns])

You can select random subset of columns by using random.sample(first_df.columns, number_of_columns) function. 您可以使用random.sample(first_df.columns, number_of_columns)函数选择列的随机子集。

Hope this helps :) 希望这可以帮助：）

Answer 2

It sounds like all that you want to do is to programmatically return the list of column names, pick out some slice or slices from that list, and then select that subset of columns in some order from your dataframe. 听起来您要做的就是以编程方式返回列名称列表，从该列表中挑选出一个或多个切片，然后以某种顺序从数据框中选择该列的子集。 You can do this by manipulating the list df.columns. 您可以通过操作列表df.columns来执行此操作。 As an example: 举个例子：

a=[list(range(10)),list(range(1,11)),list(range(2,12))]
df=sqlContext.createDataFrame(a,schema=['col_'+i for i in 'abcdefghij'])

df is a dataframe is with columns ['col_a', 'col_b', 'col_c', 'col_d', 'col_e', 'col_f', 'col_g', 'col_h', 'col_i', 'col_j'] . df是具有列['col_a', 'col_b', 'col_c', 'col_d', 'col_e', 'col_f', 'col_g', 'col_h', 'col_i', 'col_j'] 。 You can return that list by calling df.columns which you can slice and reorder like you would any other python list. 您可以通过调用df.columns来返回该列表，您可以像对其他任何python列表一样进行切片和重新排序。 How you do that is up to you and which columns you want to select from the df and in which order. 具体如何操作以及您要从df中选择哪些列以及以哪种顺序决定。 For example: 例如：

mycolumnlist=df.columns[8:9]+df.columns[0:5]
df[mycolumnlist].show()

Returns 返回

+-----+-----+-----+-----+-----+-----+
|col_i|col_a|col_b|col_c|col_d|col_e|
+-----+-----+-----+-----+-----+-----+
|    8|    0|    1|    2|    3|    4|
|    9|    1|    2|    3|    4|    5|
|   10|    2|    3|    4|    5|    6|
+-----+-----+-----+-----+-----+-----+

从pyspark中的非常大的数据框中选择随机列

问题描述

2 个解决方案

解决方案1
0 2017-07-14 14:35:56

解决方案2
0 已采纳 2017-07-14 20:33:38

从pyspark中的非常大的数据框中选择随机列

问题描述

2 个解决方案

解决方案1 0 2017-07-14 14:35:56

解决方案2 0 已采纳 2017-07-14 20:33:38

解决方案1
0 2017-07-14 14:35:56

解决方案2
0 已采纳 2017-07-14 20:33:38