为什么没有在 PySpark 中分配值？

Question

I am working with a Dataframe using PySpark in Jupyter Notebook and came across a problem which is that I have assigned values in the columns that I wanted but when I .show() the dataframe is returning the original values back.我正在使用 Jupyter Notebook 中的 PySpark 使用 Dataframe 并遇到一个问题，即我在我想要的列中分配了值但是当我.show()返回原始值时 Z6A8064B5DF4794555500553C47D55500553C47D5。 Not sure what I did wrong.不知道我做错了什么。

What I am trying to do is replicate LabelEncoder() from Pandas.我想做的是从 Pandas 复制LabelEncoder() 。 This is my solution using Pandas LabelEncoder() :这是我使用 Pandas LabelEncoder()的解决方案：

le = LabelEncoder()
result['Gender'] = le.fit_transform(result['Gender'])
result['Country'] = le.fit_transform(result['Country'])
result['self_employed'] = le.fit_transform(result['self_employed'])
result['family_history'] = le.fit_transform(result['family_history'])
result['treatment'] = le.fit_transform(result['treatment'])
result['work_interfere'] = le.fit_transform(result['work_interfere'])
result['remote_work'] = le.fit_transform(result['remote_work'])
result['tech_company'] = le.fit_transform(result['tech_company'])
result['benefits'] = le.fit_transform(result['benefits'])
result['care_options'] = le.fit_transform(result['care_options'])
result['wellness_program'] = le.fit_transform(result['wellness_program'])
result['seek_help'] = le.fit_transform(result['seek_help'])
result['anonymity'] = le.fit_transform(result['anonymity'])
result['leave'] = le.fit_transform(result['leave'])
result['mental_health_consequence'] = le.fit_transform(result['mental_health_consequence'])
result['phys_health_consequence'] = le.fit_transform(result['phys_health_consequence'])
result['coworkers'] = le.fit_transform(result['coworkers'])
result['supervisor'] = le.fit_transform(result['supervisor'])
result['mental_vs_physical'] = le.fit_transform(result['mental_vs_physical'])
result['obs_consequence'] = le.fit_transform(result['obs_consequence'])
result['mental_issue_in_tech'] = le.fit_transform(result['mental_issue_in_tech'])

Now I would like to do the same thing using PySpark but PySpark doesn't support LabelEncoder() , so I assigned values into each columns instead.现在我想使用 PySpark 做同样的事情，但 PySpark 不支持LabelEncoder() ，所以我将值分配给每一列。 Here's my code that I have tried to attempted:这是我尝试尝试的代码：

new_result = result.withColumn('Gender',f.when(f.col('Gender')== 'Male',f.lit(0)).\
when(f.col('Gender')== 'Female',f.lit(1)).\
when(f.col('Gender')== 'Other',f.lit(2)).\
otherwise(f.col('Gender')))


new_result = result.withColumn('Country',f.when(f.col('Country')== 'Russia',f.lit(0)).\
when(f.col('Country')== 'Bahamas The',f.lit(1)).\
when(f.col('Country')== 'Sweden',f.lit(2)).\
when(f.col('Country')== 'Philippines',f.lit(3)).\
when(f.col('Country')== 'Singapore',f.lit(4)).\
when(f.col('Country')== 'Germany',f.lit(5)).\
when(f.col('Country')== 'France',f.lit(6)).\
when(f.col('Country')== 'Greece',f.lit(7)).\
when(f.col('Country')== 'Belgium',f.lit(8)).\
when(f.col('Country')== 'Finland',f.lit(9)).\
when(f.col('Country')== 'United States',f.lit(10)).\
when(f.col('Country')== 'India',f.lit(11)).\
when(f.col('Country')== 'China',f.lit(12)).\
when(f.col('Country')== 'Croatia',f.lit(13)).\
when(f.col('Country')== 'Nigeria',f.lit(14)).\
when(f.col('Country')== 'Italy',f.lit(15)).\
when(f.col('Country')== 'Norway',f.lit(16)).\
when(f.col('Country')== 'Spain',f.lit(17)).\
when(f.col('Country')== 'Denmark',f.lit(18)).\
when(f.col('Country')== 'Ireland',f.lit(19)).\
when(f.col('Country')== 'Thailand',f.lit(20)).\
when(f.col('Country')== 'Israel',f.lit(21)).\
when(f.col('Country')== 'Uruguay',f.lit(22)).\
when(f.col('Country')== 'Mexico',f.lit(23)).\
when(f.col('Country')== 'Georgia',f.lit(24)).\
when(f.col('Country')== 'Switzerland',f.lit(25)).\
when(f.col('Country')== 'Latvia',f.lit(26)).\
when(f.col('Country')== 'Canada',f.lit(27)).\
when(f.col('Country')== 'Czech Republic',f.lit(28)).\
when(f.col('Country')== 'Brazil',f.lit(29)).\
when(f.col('Country')== 'Slovenia',f.lit(30)).\
when(f.col('Country')== 'Japan',f.lit(31)).\
when(f.col('Country')== 'New Zealand',f.lit(32)).\
when(f.col('Country')== 'Bosnia and Herzegovina',f.lit(33)).\
when(f.col('Country')== 'Poland',f.lit(34)).\
when(f.col('Country')== 'Portugal',f.lit(35)).\
when(f.col('Country')== 'Australia',f.lit(36)).\
when(f.col('Country')== 'Romania',f.lit(37)).\
when(f.col('Country')== 'Bulgaria',f.lit(38)).\
when(f.col('Country')== 'Austria',f.lit(39)).\
when(f.col('Country')== 'Costa Rica',f.lit(40)).\
when(f.col('Country')== 'South Africa',f.lit(41)).\
when(f.col('Country')== 'Colombia',f.lit(42)).\
when(f.col('Country')== 'Hungary',f.lit(43)).\
when(f.col('Country')== 'United Kingdom',f.lit(44)).\
when(f.col('Country')== 'Moldova',f.lit(45)).\
when(f.col('Country')== 'Netherlands',f.lit(46)).\
otherwise(f.col('Country')))


new_result = result.withColumn('self_employed',f.when(f.col('self_employed')== 'NA',f.lit(0)).\
when(f.col('self_employed')== 'No',f.lit(1)).\
when(f.col('self_employed')== 'Yes',f.lit(2)).\
otherwise(f.col('self_employed')))


new_result = result.withColumn('family_history',f.when(f.col('family_history')== 'No',f.lit(0)).\
when(f.col('family_history')== 'Yes',f.lit(1)).\
otherwise(f.col('family_history')))


new_result = result.withColumn('treatment',f.when(f.col('treatment')== 'No',f.lit(0)).\
when(f.col('treatment')== 'Yes',f.lit(1)).\
otherwise(f.col('treatment')))


new_result = result.withColumn('work_interfere',f.when(f.col('work_interfere')== 'Sometimes',f.lit(2)).\
when(f.col('work_interfere')== 'Rarely',f.lit(1)).\
when(f.col('work_interfere')== 'Often',f.lit(3)).\
when(f.col('work_interfere')== 'Never',f.lit(0)).\
otherwise(f.col('work_interfere')))


new_result = result.withColumn('remote_work',f.when(f.col('remote_work')== 'No',f.lit(0)).\
when(f.col('remote_work')== 'Yes',f.lit(1)).\
otherwise(f.col('remote_work')))


new_result = result.withColumn('tech_company',f.when(f.col('tech_company')== 'No',f.lit(0)).\
when(f.col('tech_company')== 'Yes',f.lit(1)).\
otherwise(f.col('tech_company')))


new_result = result.withColumn('benefits',f.when(f.col('benefits')== 'No',f.lit(0)).\
when(f.col('benefits')== 'Yes',f.lit(1)).\
when(f.col('benefits')== "Don't know",f.lit(2)).\
otherwise(f.col('benefits')))


new_result = result.withColumn('care_options',f.when(f.col('care_options')== 'No',f.lit(0)).\
when(f.col('care_options')== 'Yes',f.lit(1)).\
when(f.col('care_options')== "Not sure",f.lit(2)).\
otherwise(f.col('care_options')))


new_result = result.withColumn('wellness_program',f.when(f.col('wellness_program')== 'No',f.lit(0)).\
when(f.col('wellness_program')== 'Yes',f.lit(1)).\
when(f.col('wellness_program')== "Don't know",f.lit(2)).\
otherwise(f.col('wellness_program')))


new_result = result.withColumn('seek_help',f.when(f.col('seek_help')== 'No',f.lit(0)).\
when(f.col('seek_help')== 'Yes',f.lit(1)).\
when(f.col('seek_help')== "Don't know",f.lit(2)).\
otherwise(f.col('seek_help')))


new_result = result.withColumn('anonymity',f.when(f.col('anonymity')== 'No',f.lit(0)).\
when(f.col('anonymity')== 'Yes',f.lit(1)).\
when(f.col('anonymity')== "Don't know",f.lit(2)).\
otherwise(f.col('anonymity')))


new_result = result.withColumn('leave',f.when(f.col('leave')== 'Somewhat difficult',f.lit(0)).\
when(f.col('leave')== 'Somewhat easy',f.lit(1)).\
when(f.col('leave')== "Don't know",f.lit(2)).\
when(f.col('leave')== "Very difficult",f.lit(3)).\
when(f.col('leave')== "Very easy",f.lit(4)).\
otherwise(f.col('leave')))


new_result = result.withColumn('mental_health_consequence',f.when(f.col('mental_health_consequence')== 'No',f.lit(0)).\
when(f.col('mental_health_consequence')== 'Yes',f.lit(1)).\
when(f.col('mental_health_consequence')== "Maybe",f.lit(2)).\
otherwise(f.col('mental_health_consequence')))


new_result = result.withColumn('phys_health_consequence',f.when(f.col('phys_health_consequence')== 'No',f.lit(0)).\
when(f.col('phys_health_consequence')== 'Yes',f.lit(1)).\
when(f.col('phys_health_consequence')== "Maybe",f.lit(2)).\
otherwise(f.col('phys_health_consequence')))


new_result = result.withColumn('coworkers',f.when(f.col('coworkers')== 'No',f.lit(0)).\
when(f.col('coworkers')== 'Yes',f.lit(1)).\
when(f.col('coworkers')== "Some of them",f.lit(2)).\
otherwise(f.col('coworkers')))


new_result = result.withColumn('supervisor',f.when(f.col('supervisor')== 'No',f.lit(0)).\
when(f.col('supervisor')== 'Yes',f.lit(1)).\
when(f.col('supervisor')== "Some of them",f.lit(2)).\
otherwise(f.col('supervisor')))


new_result = result.withColumn('mental_vs_physical',f.when(f.col('mental_vs_physical')== 'No',f.lit(0)).\
when(f.col('mental_vs_physical')== 'Yes',f.lit(1)).\
when(f.col('mental_vs_physical')== "Don't know",f.lit(2)).\
otherwise(f.col('mental_vs_physical')))


new_result = result.withColumn('obs_consequence',f.when(f.col('obs_consequence')== 'No',f.lit(0)).\
when(f.col('obs_consequence')== 'Yes',f.lit(1)).\
otherwise(f.col('obs_consequence')))


new_result = result.withColumn('mental_issue_in_tech',f.when(f.col('mental_issue_in_tech')== False, 0).otherwise(1))
new_result.show()

Answer 1

You're overwriting new_result each time you code a var:每次编写 var 时都会覆盖new_result ：

# new_result assigned
new_result = result.withColumn('Gender',f.when(f.col('Gender')== 'Male',f.lit(0)).\
when(f.col('Gender')== 'Female',f.lit(1)).\
when(f.col('Gender')== 'Other',f.lit(2)).\
otherwise(f.col('Gender')))

# Previously assigned new_result overwritten!!
new_result = result.withColumn('Country',f.when(f.col('Country')== 'Russia',f.lit(0)).\
when(f.col('Country')== 'Bahamas The',f.lit(1)).\
when(f.col('Country')== 'Sweden',f.lit(2)).\
when(f.col('Country')== 'Philippines',f.lit(3)).\
...

Do something like this:做这样的事情：

# Make new_result point to result
new_result = result

# Now you can reassign to the same df each time
new_result = new_result.withColumn('Gender',f.when(f.col('Gender')== 'Male',f.lit(0)).\
when(f.col('Gender')== 'Female',f.lit(1)).\
when(f.col('Gender')== 'Other',f.lit(2)).\
otherwise(f.col('Gender')))

# Reassigning again...
new_result = new_result.withColumn('Country',f.when(f.col('Country')== 'Russia',f.lit(0)).\
when(f.col('Country')== 'Bahamas The',f.lit(1)).\
when(f.col('Country')== 'Sweden',f.lit(2)).\
when(f.col('Country')== 'Philippines',f.lit(3)).\
...

为什么没有在 PySpark 中分配值？

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-05-21 13:01:36

为什么没有在 PySpark 中分配值？

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-05-21 13:01:36

解决方案1
1 已采纳 2020-05-21 13:01:36