[英]Extract required data between characters from a string using regex in python or pyspark
I want to extract data between few characters from the string data present in the rows of a dataframe column.我想从 dataframe 列的行中存在的字符串数据中提取几个字符之间的数据。 For example I have the data in the column like below:
例如,我在下面的列中有数据:
+----------------------------------------------------+
| Azure|
+----------------------------------------------------+
|{ref=[As Tailwind Traders gets, started with Azure]}|
|{ref=first steps} |
|{ref=will be to create} |
|{ref=at least one Azure subscription} |
+----------------------------------------------------+
And want to transform in this way并想以这种方式进行改造
+----------------------------------------------------+
| Azure|
+----------------------------------------------------+
|As Tailwind Traders gets, started with Azure |
|first steps |
|will be to create |
|at least one Azure subscription |
+----------------------------------------------------+
So I should extract data between "[]" and also the the rows with single element and put it back into the same or a new column using pyspark/python regex things to be removed - 'ref=',outer '{}'所以我应该在“[]”和具有单个元素的行之间提取数据,并使用 pyspark/python 正则表达式将其放回相同或新的列中要删除的东西 - 'ref=',outer '{}'
Note - I tried using the regex_replace function but it is also replacing the the [],{} inside the required data注意 - 我尝试使用 regex_replace function 但它也替换了所需数据中的 [],{}
So how can I achieve this using regex in pyspark?那么我怎样才能在 pyspark 中使用正则表达式来实现呢?
You can use the following pattern, putting \1 in the substitution string.您可以使用以下模式,将 \1 放入替换字符串中。
"{ref=\[?([,\w\s]+)\]?\}"gm
See https://regex101.com/r/OyFBkJ/1参见https://regex101.com/r/OyFBkJ/1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.