[英]Remove “regex duplicates” from ArrayList in java
I want to "clean" an ArrayList in java, here is the explanation 我想在Java中“清理” ArrayList,这是解释
Assuming we have this list : 假设我们有以下列表:
a = ["a_12_b", "a_13_b", "a_13bis_b", "a_14_b", "a_14_new_b"]
In this list, "a_13bis_b"
and "a_14_new_b"
are considered as duplicates, Why ? 在此列表中,
"a_13bis_b"
和"a_14_new_b"
被视为重复项,为什么? because each entry have this regex : a_ "a string with a lenght =2" _b
因为每个条目都具有此正则表达式:
a_ "a string with a lenght =2" _b
The output should be : 输出应为:
a = ["a_12_b", "a_13_b", "a_14_b"]
I used this simple code, but it returns wrong output : 我使用了以下简单代码,但返回错误输出:
for (int j = 0; j < list.size(); j++) {
//basically clean entry will remove the a_ and _b
String value1= cleanEntry(list.get(j));
for (int k = 0; k < list.size(); k++) {
String value2= cleanEntry(list.get(k));
if (k != j && value1.equalsIgnoreCase(value2)) {
duplicates.add(list.get(k))
list.remove(k);
}
}
}
Any help ? 有什么帮助吗?
You could use the stream map method with a regular expression to "normalize" the strings to a common format and then create a set out of the normalized strings. 您可以使用带有正则表达式的流映射方法将字符串“规范化”为通用格式,然后从规范化的字符串中创建一个集合。
Something like this: 像这样:
List<String> a = Arrays.asList("a_12_b", "a_13_b", "a_13bis_b", "a_14_b", "a_14_new_b");
Set<String> uniques = a.stream()
.map(s -> s.replaceAll("^([a-z]_\\d{2})[^\\d].+(_[a-z])$", "$1$2"))
.collect(Collectors.toSet());
System.out.println(uniques);
This prints: 打印:
[a_14_b, a_13_b, a_12_b]
[a_14_b,a_13_b,a_12_b]
Solution for Java 7, 6: Java 7、6的解决方案:
List<String> a = Arrays.asList("a_12_b", "a_13_b", "a_13bis_b", "a_14_b", "a_14_new_b");
Set<String> set = new LinkedHashSet<>();
for(String s : a) {
set.add(s.replaceAll("^([a-z]_\\d{2})[^\\d].+(_[a-z])$", "$1$2"));
}
System.out.println(set);
Result: 结果:
[a_12_b, a_13_b, a_14_b]
[a_12_b,a_13_b,a_14_b]
If you need more than 2 numeric characters, you can change the regular expression. 如果需要两个以上的数字字符,则可以更改正则表达式。 Here is an example with result:
这是结果示例:
List<String> a = Arrays.asList("a_12345678901234567890123456_b", "a_13345678901234567890123456_b",
"a_13345678901234567890123456bis_b", "a_14345678901234567890123456_b", "a_14345678901234567890123456_new_b");
Set<String> set = new LinkedHashSet<>();
for(String s : a) {
set.add(s.replaceAll("^([a-z]_\\d{26})[^\\d].+(_[a-z])$", "$1$2"));
}
System.out.println(set);
Result: 结果:
[a_12345678901234567890123456_b, a_13345678901234567890123456_b, a_14345678901234567890123456_b]
[a_12345678901234567890123456_b,a_13345678901234567890123456_b,a_14345678901234567890123456_b,
You can simply discard all the characters after 2nd character before comparison. 您可以简单地丢弃比较之前第二个字符之后的所有字符。 Try this..
尝试这个..
for (int j = 0; j < list.size(); j++) {
//basically clean entry will remove the a_ and _b
String value1= cleanEntry(list.get(j));
for (int k = 0; k < list.size(); k++) {
String value2= cleanEntry(list.get(k));
if (k != j && value1.substring(0,2).equalsIgnoreCase(value2.substring(0,2))) {
duplicates.add(list.get(k)) list.remove(k);
}
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.