简体   繁体   English

如何基于其他列使用 Python 在 Spark 中创建新列?

[英]How create new column in Spark using Python, based on other column?

My database contains one column of strings.我的数据库包含一列字符串。 I'm going to create a new column based on part of string of other columns.我将根据其他列的部分字符串创建一个新列。 For example:例如:

         "content"                             "other column"
The father has two dogs                            father
One cat stay at home of my mother                  mother
etc.                                               etc.

I thought to create an array with words who interessed me.我想用我感兴趣的单词创建一个数组。 For example: people=[mother,father,etc.]例如:people=[mother,father,etc.]

Then, I iterate on column "content" and extract the word to insert on new column:然后,我迭代列“内容”并提取要插入新列的单词:



def extract_people(df):
    column=[]
    people=[mother,father,etc.]
    for row in df.select("content").collect():
        for word in people:
            if str(row).find(word):
                column.append(word)
                break
    return pd.Series(column)


f_pyspark = df_pyspark.withColumn('people', extract_people(df_pyspark))

This code don't work and give me this error on the collect():此代码不起作用,并在 collect() 上给我这个错误:

22/01/26 11:34:04 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 36)
java.lang.OutOfMemoryError: Java heap space

Maybe because my file is too large, have 15 million of row.也许是因为我的文件太大,有 1500 万行。 How I may make the new column in different mode?如何以不同的模式制作新列?

Using the following dataframe as an example以下面的 dataframe 为例

+---------------------------------+
|content                          |
+---------------------------------+
|Thefatherhas two dogs            |
|The fatherhas two dogs           |
|Thefather has two dogs           |
|Thefatherhastwodogs              |
|One cat stay at home of my mother|
|One cat stay at home of mymother |
|Onecatstayathomeofmymother       |
|etc.                             |
|my feet smell                    |
+---------------------------------+

You can do the following您可以执行以下操作

from pyspark.sql import functions

arr = ["father", "mother", "etc."]

expression = (
   "CASE " + 
    "".join(["WHEN content LIKE '%{}%' THEN '{}' ".format(val, val) for val in arr]) + 
     "ELSE 'None' END")

df = df.withColumn("other_column", functions.expr(expression))
df.show()
+---------------------------------+------------+
|content                          |other_column|
+---------------------------------+------------+
|Thefatherhas two dogs            |father      |
|The fatherhas two dogs           |father      |
|Thefather has two dogs           |father      |
|Thefatherhastwodogs              |father      |
|One cat stay at home of my mother|mother      |
|One cat stay at home of mymother |mother      |
|Onecatstayathomeofmymother       |mother      |
|etc.                             |etc.        |
|my feet smell                    |None        |
+---------------------------------+------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python:使用 len() 根据其他列的值创建新列 - Python: Create New Column based on values of other column using len() 如何根据其他列的具体考虑创建新列 - Python - How to create a new column based on specific considerations of other columns - Python 根据其他列(python)中的分类值创建新的pandas列 - Create new pandas column based on categorical values in other column (python) 根据 python 中的其他列值创建新列 - Create new column based on other column value in python Pandas/Python:如何根据其他列的值创建新列并将额外条件应用于此新列 - Pandas/Python: How to create new column based on values from other columns and apply extra condition to this new column 如何使用 Python 中的其他列值创建新列? - How to Create new columns using other column Values in Python? Python Spark - 如何创建一个新列,在数据帧上对现有列进行切片? - Python Spark - How to create a new column slicing an existing column on the dataframe? MultiIndex DataFrame:如何基于其他列中的值创建新列? - MultiIndex DataFrame: How to create a new column based on values in other column? 如何根据熊猫中其他列的条件创建新列 - How to create a new column based on conditions on other column in pandas 如何根据 pandas dataframe 中其他列中的子字符串创建新列? - How to create new column based on substrings in other column in a pandas dataframe?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM