[英]Extracting specific string from a column content using Apache spark java
I have this column from my dataframe:我的 dataframe 中有这个专栏:
+--------------------------------------------------------------------+
| _classname |
+--------------------------------------------------------------------+
|de.abs.org.vertrag.util.test.Elementofthetable |
|de.abs.org.vertrag.util.test.classes.Apachesparkproject |
+--------------------------------------------------------------------+
I want to get as expected dataframe:我想按预期得到 dataframe:
+--------------------------------------------------------------------+
| _classname |
+--------------------------------------------------------------------+
|de.abs.org.vertrag.util.test |
|de.abs.org.vertrag.util.test.classes |
+--------------------------------------------------------------------+
I want to get rid of the classname of everypackage which is after the last (.) here is my code:我想摆脱最后一个(。)之后的每个包的类名,这是我的代码:
teamNamesDf = teamNamesDf.select(col("_classname"), substring_index(col("_classname"), ".\\[^A-Z]+", -1).as("testSuiteName"));
This didnt work for me:( does anyone knows how to do that using Apache Spark Java?这对我不起作用:(有谁知道如何使用 Apache Spark Java 做到这一点?
You can try this regex, which extracts everything before the final .
你可以试试这个正则表达式,它会在 final 之前提取所有内容.
: :
Dataset<Row> result = df.select(regexp_extract(col("_classname"), "^(.*)\\.[^\\.]*$", 1))
Another way is to use substring_index
with a variable length that depends on the number of .
另一种方法是使用具有可变长度的substring_index
,该可变长度取决于.
in the string:在字符串中:
Dataset<Row> result = df.selectExpr("substring_index(_classname, '.', size(split(_classname, '\\\\.')) - 1)")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.