[英]What is the most efficient way of splitting a string and ensuring there are no duplicates in the resulting array?
I am splitting a javscript string into an array whose elements just contain sequences of cyrillic characters. 我将一个javscript字符串拆分为一个数组,该数组的元素仅包含西里尔字符序列。
var text = "где по его проекту был реализован первый в мире компьютер с хранимой в памяти программой — ACE."
text=text.toLowerCase();
var re = /[^йцукенгшщзхъёэждлорпавыфячсмитьбю]+/;
words = text.split(re);
In the above snippet words will contain the following 在以上代码段中,单词将包含以下内容
["где", "по", "его", "проекту", "был", "реализован", "первый", "в", "мире", "компьютер", "с", "хранимой", "в", "памяти", "программой", ""]
I need to remove the duplicate from the array. 我需要从阵列中删除重复项。 Namely I should only see the occurence of "в" once. 即,我应该只看到“•”的出现一次。 I know I can after the split and go through the array doing this but not sure what is the best way. 我知道拆分后可以遍历数组,但是不确定什么是最好的方法。 Is it possible to do this with the split regex? 可以使用分割正则表达式来做到这一点吗?
Jonathan 乔纳森
Not the most efficient, but it's clean and simple. 不是最有效的,但是它很干净而且很简单。
text.split(re).filter(function(str, idx, txtArray) {
return txtArray.indexOf(str) === idx;
});
Basically, if the first index found doesn't match the current index in the iteration, it's a duplicate. 基本上,如果在迭代中找到的第一个索引与当前索引不匹配,则它是重复项。
You have to go through the array. 您必须遍历数组。 You can remember whether you've seen instances of the string before using an object as a map, eg: 您可以记住在将对象用作映射之前是否已经查看过字符串的实例,例如:
var a = /* ...get the array... */;
var unique = [];
var n, len;
var str;
var seen = {};
for (n = 0, len = a.length; n < len; ++n) {
str = a[n];
if (!seen[str]) {
seen[str] = true;
unique.push(str);
}
}
If there's any chance one of the string values may be a name that already exists on objects (so, "toString"
, "valueOf"
, "hasOwnProperty"
, and such), you have to modify the if (!seen[str])
check to use hasOwnProperty
instead: 如果字符串值之一可能是对象上已经存在的名称(因此, "toString"
, "valueOf"
, "hasOwnProperty"
等),则必须修改if (!seen[str])
检查使用hasOwnProperty
代替:
if (!seen.hasOwnProperty(str)) {
...but if the strings are as you've shown, you don't need that. ...但是如果字符串如您所显示的那样,则不需要。 Another alternative is to use a prefix like "xx": 另一种选择是使用前缀“ xx”:
var keystr = "xx" + str;
if (!seen[keystr]) {
seen[keystr] = true;
// ...
}
Since there are no object properties on raw objects that start with "xx"
, and almost certainly never will be. 由于在以"xx"
开头的原始对象上没有对象属性,因此几乎可以肯定不会。
In a comment you've said: 在评论中,您说过:
I guess by efficient I mean the most elegant of idiomatic javascript way to do this. 我想高效是指惯用javascript最优雅的方式来做到这一点。
Interesting, that's not a definition I'd've used. 有趣的是,这不是我使用的定义。 :-) Okay, here's another approach using ES5's filter
, which is definitely more JavaScript-y: :-)好的,这是使用ES5的filter
的另一种方法,绝对是JavaScript-y:
var a = /* ...get the array... */;
var seen = {};
a = a.filter(function(str) {
if (!seen[str]) {
seen[str] = true;
return true;
}
return false;
});
If you are willing to use a third party library, then I would recommend to have a look at Underscore . 如果您愿意使用第三方库,那么我建议您看一下Underscore 。 This Library provides a uniq
method, that you would apply in the following way: 该库提供了一种uniq
方法,您可以通过以下方式应用该方法:
words = _.uniq(text.split(re));
You can get the "prettiness" of the .indexOf
solution using some other built-in functions: 您可以使用其他一些内置函数来获得.indexOf
解决方案的“ .indexOf
性”:
var uniq = Object.keys(text.split(re).reduce(function(words, word) {
words[word] = null;
return words;
}, {}));
This'll only work in newer versions of JavaScript (that is, not old versions of IE). 这仅适用于JavaScript的新版本(即IE的旧版本)。 This has the advantage, like Mr. Crowder's version, of not being an O(n 2 ) algorithm. 像Crowder先生的版本一样,它具有不是O(n 2 )算法的优点。 On fairly large strings without many duplicates (say, a page full of text), those .indexOf()
calls will start to warm up the client CPU. 在没有很多重复项的相当大的字符串(例如,充满文本的页面)上,那些.indexOf()
调用将开始预热客户端CPU。
Note that this will give you the unique words in no particular order. 请注意,这将为您提供不特定顺序的唯一单词。
如何在正则表达式中使用负前瞻并使用.match方法返回匹配数组。
([йцукенгшщзхъёэждлорпавыфячсмитьбю]+)(?!.*\1)
You could do this (splitter : " "
) : 您可以这样做(分割符: " "
):
var m = 'azerty rty aze rty aze'
.replace(/(^| )([^ ]+)(?= |$)(?=.* \2( |$))/g, '') // removes duplicates
.match(/[^ ]+/g)
m; // ["azerty", "rty", "aze"]
Surely not the most efficient way though. 当然不是最有效的方法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.