使用 Apache spark java 从列内容中提取特定字符串

Question

I have this column from my dataframe:我的 dataframe 中有这个专栏：

+--------------------------------------------------------------------+
|          _classname                                                |
+--------------------------------------------------------------------+
|de.abs.org.vertrag.util.test.Elementofthetable                      |
|de.abs.org.vertrag.util.test.classes.Apachesparkproject             |
+--------------------------------------------------------------------+

I want to get as expected dataframe:我想按预期得到 dataframe：

+--------------------------------------------------------------------+
|           _classname                                               |
+--------------------------------------------------------------------+
|de.abs.org.vertrag.util.test                                        |
|de.abs.org.vertrag.util.test.classes                                |
+--------------------------------------------------------------------+

I want to get rid of the classname of everypackage which is after the last (.) here is my code:我想摆脱最后一个（。）之后的每个包的类名，这是我的代码：

teamNamesDf = teamNamesDf.select(col("_classname"), substring_index(col("_classname"), ".\\[^A-Z]+", -1).as("testSuiteName"));

This didnt work for me:( does anyone knows how to do that using Apache Spark Java?这对我不起作用:(有谁知道如何使用 Apache Spark Java 做到这一点？

Answer 1

You can try this regex, which extracts everything before the final .你可以试试这个正则表达式，它会在 final 之前提取所有内容. : ：

Dataset<Row> result = df.select(regexp_extract(col("_classname"), "^(.*)\\.[^\\.]*$", 1))

Another way is to use substring_index with a variable length that depends on the number of .另一种方法是使用具有可变长度的substring_index ，该可变长度取决于. in the string:在字符串中：

Dataset<Row> result = df.selectExpr("substring_index(_classname, '.', size(split(_classname, '\\\\.')) - 1)")

使用 Apache spark java 从列内容中提取特定字符串

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-04-01 15:17:39

使用 Apache spark java 从列内容中提取特定字符串

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-04-01 15:17:39

解决方案1
0 已采纳 2021-04-01 15:17:39