简体   繁体   English

使用命名正则表达式从PySpark DataFrame中的列中提取多个列

[英]Extracting multiple columns from column in PySpark DataFrame using named regex

Suppose I have a DataFrame df in pySpark of the following form: 假设我在pySpark中有以下形式的DataFrame df

| id | type | description                                                  |
|  1 | "A"  | "Date: 2018/01/01\nDescr: This is a test des\ncription\n     |
|  2 | "B"  | "Date: 2018/01/02\nDescr: Another test descr\niption\n       |
|  3 | "A"  | "Date: 2018/01/03\nWarning: This is a warnin\ng, watch out\n |

which is of course a dummy set, but will suffice for this example. 这当然是一个虚拟集,但足以满足此示例的要求。

I have made a regex-statement with named groups that can be used to extract the relevant information from the description-field, something along the lines of: 我用命名组制作了一个正则表达式,可用于从描述字段中提取相关信息,类似于:

^(?:(?:Date: (?P<DATE>.+?)\n)|(?:Descr: (?P<DESCR>.+?)\n)|(?:Warning: (?P<WARNING>.+?)\n)+$

again, dummy regex, the actual regex is somewhat more elaborate, but the purpose is to capture three possible groups: 再次,虚拟正则表达式,实际的正则表达式稍微复杂一些,但目的是捕获三个可能的组:

| DATE       | DESCR                        | WARNING                        |
| 2018/01/01 | This is a test des\ncription | None                           |
| 2018/01/02 | Another test descr\niption   | None                           |
| 2018/01/03 | None                         | This is a warnin\ng, watch out |

Now I would want to add the columns that are the result of the regex match to the original DataFrame (ie combine the two dummy tables in this question into one). 现在,我想将正则表达式匹配结果的列添加到原始DataFrame中(即,将这个问题中的两个虚拟表合并为一个)。

I have tried several ways to accomplish this, but none have lead to the full solution yet. 我尝试了几种方法来实现此目的,但是还没有一种方法可以提供完整的解决方案。 A thing I've tried is: 我尝试过的一件事是:

def extract_fields(string):
   patt = <ABOVE_PATTERN>
   result = re.match(patt, string, re.DOTALL).groupdict()
   # Actually, a slight work-around is needed to overcome the None problem when 
   #   no match can be made, I'm using pandas' .str.extract for this now
   return result

df.rdd.map(lambda x: extract_fields(x.description))

This will yield the second table, but I see no way to combine this with the original columns from df . 这将产生第二张表,但是我看不到将其与df的原始列结合的方法。 I have tried to construct a new Row() , but then I run into problems with the ordering of columns (and the fact that I cannot hard-code the column names that will be added by the regex groups) that is needed in the Row() -constructor, resulting in a dataframe that is has the columns all jumbled up. 我试图建立一个新的Row()但后来我碰到与列的顺序问题(而事实上,我不能硬编码,将通过正则表达式组中添加的列名)所需要的Row() -constructor,导致一个数据列的所有列都混杂在一起。 How can I achieve what I want, ie one DataFrame with six columns: id , type , description , DATE , DESCR and WARNING ? 如何实现所需的数据,即一个具有六列的DataFrame: idtypedescriptionDATEDESCRWARNING

Remark . 备注 Actually, the description field is not just one field, but several columns. 实际上,描述字段不仅是一个字段,而是几列。 Using concat_ws , I have concatenated these columns into a new columns description with the description-fields separated with \\n , but maybe this can be incorporated in a nicer way. 使用concat_ws ,我已将这些列连接到一个新的列description ,其描述字段用\\n分隔,但是也许可以用一种更好的方式将其合并。

I think you can use Pandas features for this case. 我认为您可以在这种情况下使用Pandas功能。 Firstly I convert df to rdd to split description field. 首先,我将df转换为rdd以拆分描述字段。 I pull a Pandas df then I create spark df with using Pandas df. 我拉一个熊猫df,然后使用熊猫df创建spark df。 It works regardless of column numbers in description field 无论描述字段中的列号如何,它都可以工作

>>> import pandas as pd
>>> import re
>>> 
>>> df.show(truncate=False)
+---+----+-----------------------------------------------------------+
|id |type|description                                                |
+---+----+-----------------------------------------------------------+
|1  |A   |Date: 2018/01/01\nDescr: This is a test des\ncription\n    |
|2  |B   |Date: 2018/01/02\nDescr: Another test desc\niption\n       |
|3  |A   |Date: 2018/01/03\nWarning: This is a warnin\ng, watch out\n|
+---+----+-----------------------------------------------------------+

>>> #convert df to rdd
>>> rdd = df.rdd.map(list)
>>> rdd.first()
[1, 'A', 'Date: 2018/01/01\\nDescr: This is a test des\\ncription\\n']
>>> 
>>> #split description field
>>> rddSplit = rdd.map(lambda x: (x[0],x[1],re.split('\n(?=[A-Z])', x[2].encode().decode('unicode_escape'))))
>>> rddSplit.first()
(1, 'A', ['Date: 2018/01/01', 'Descr: This is a test des\ncription\n'])
>>> 
>>> #create empty Pandas df
>>> df1 = pd.DataFrame()
>>> 
>>> #insert rows
>>> for rdd in rddSplit.collect():
...     a = {i.split(':')[0].strip():i.split(':')[1].strip('\n').replace('\n','\\n').strip() for i in rdd[2]}
...     a['id'] = rdd[0]
...     a['type'] = rdd[1]
...     df2 = pd.DataFrame([a], columns=a.keys())
...     df1 = pd.concat([df1, df2])
... 
>>> df1
         Date                         Descr                         Warning  id type
0  2018/01/01  This is a test des\ncription                             NaN   1    A
0  2018/01/02     Another test desc\niption                             NaN   2    B
0  2018/01/03                           NaN  This is a warnin\ng, watch out   3    A
>>>
>>> #create spark df
>>> df3 = spark.createDataFrame(df1.fillna('')).replace('',None)
>>> df3.show(truncate=False)
+----------+----------------------------+------------------------------+---+----+
|Date      |Descr                       |Warning                       |id |type|
+----------+----------------------------+------------------------------+---+----+
|2018/01/01|This is a test des\ncription|null                          |1  |A   |
|2018/01/02|Another test desc\niption   |null                          |2  |B   |
|2018/01/03|null                        |This is a warnin\ng, watch out|3  |A   |
+----------+----------------------------+------------------------------+---+----+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用python将多个列合并到pyspark数据帧中的一列中 - Merge multiple columns into one column in pyspark dataframe using python 从熊猫数据框中提取多列的组合 - Extracting combination of multiple columns from pandas dataframe 从单个 pyspark 数据帧返回多列 - Returning multiple columns from a single pyspark dataframe Python - 使用正则表达式从 Pandas DataFrame 中的列(包含字符串)中提取权重并将其添加到新列中 - Python - Extracting weight from a column (containing a string) in a Pandas DataFrame using regex and adding it to a new column 使用正则表达式将一列中的字符串仅文本提取到python数据框中的另一列时出错 - Error while extracting only text from a string in a column into another column in python dataframe using regex 使用来自其他数据帧的条件从 pyspark dataframe 中提取数据 - Extracting data from a pyspark dataframe using conditions from other dataframes 将单列拆分为多列的最佳方法 Dataframe PySpark - Best approach to split the single column into multiple columns Dataframe PySpark 使用正则表达式和.apply 从带有字符串对象的 pandas dataframe 列中提取数字 - Extracting the numbers from a pandas dataframe column with string objects using regex and .apply pyspark dataframe根据列后缀转置多列 - pyspark dataframe transpose multiple columns based on column suffix Pyspark dataframe 与 XML 列和内部多个值:从中提取列 - Pyspark dataframe with XML column and multiple values inside: Extract columns out of it
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM