简体   繁体   English

使用 Python 从 JavaScript 数组中删除重复项

[英]Remove duplicates from JavaScript array using Python

Suppose I have a JavaScript array of elements that looks something very similar to:假设我有一个 JavaScript 元素数组,看起来非常类似于:

var oui = new Array({
    "pfx": "000000",
    "mask": 24,
    "desc": "00:00:00   Officially Xerox, but 0:0:0:0:0:0 is more common"
},{
    "pfx": "000001",
    "mask": 24,
    "desc": "Xerox  Xerox Corporation"
},{
    "pfx": "000002",
    "mask": 24,
    "desc": "Xerox  Xerox Corporation"
},{
    "pfx": "000003",
    "mask": 24,
    "desc": "Xerox  Xerox Corporation"
},{
    "pfx": "000004",
    "mask": 24,
    "desc": "Xerox  Xerox Corporation"
},{
    "pfx": "000004",
    "mask": 24,
    "desc": "Let's pretend this is a repeat"
   });

Imagine now that the file is very large, and some of the "pfx" values are repeated throughout the data set.现在想象一下,文件非常大,一些“pfx”值在整个数据集中重复出现。 Obviously manual de-duping is out of the question, so I'm trying to figure out the best way to approach it programmatically.显然,手动重复数据删除是不可能的,所以我试图找出以编程方式处理它的最佳方法。 How can I write a python script to read in the.JS file containing this data set to de-dupe and remove any duplicates?如何编写 python 脚本来读取包含此数据集的 .JS 文件以进行重复数据删除并删除任何重复项? In other words, I would like to read in the JS file, parse the array, and produce another JavaScript file with a similar array, but only unique values for the pfx variable.换句话说,我想读入 JS 文件,解析数组,然后生成另一个 JavaScript 文件,该文件具有类似的数组,但 pfx 变量只有唯一值。

I've gone through a couple of other Stack Overflow questions that are similar in nature, but nothing seems to quite fit this case.我已经经历了其他几个本质上相似的 Stack Overflow 问题,但似乎没有什么适合这种情况。 In my python testing, I can rarely just get the pfx variables by themselves to remove the duplicates, or Python struggles to read it in as a proper JSON object (even without the "var" and "new Array" portion). In my python testing, I can rarely just get the pfx variables by themselves to remove the duplicates, or Python struggles to read it in as a proper JSON object (even without the "var" and "new Array" portion). I should also note, that the reason that I'm doing the de-duping in Python over another JavaScript function within the JS file (which I tried following examples like this ) is that it just inflates the size of the JavaScript that has to be loaded onto the page. I should also note, that the reason that I'm doing the de-duping in Python over another JavaScript function within the JS file (which I tried following examples like this ) is that it just inflates the size of the JavaScript that has to be加载到页面上。

In the future, the array is likely to continue grow - thus to avoid unnecessary loading of JavaScript to keep page response times quick, I figured this was a step that could, and should, be performed offline, and added to the page.将来,该阵列可能会继续增长 - 因此,为了避免不必要地加载 JavaScript 以保持页面响应时间快速,我认为这是一个可以而且应该离线执行并添加到页面的步骤。

For clarification, here is a model of the website I'm trying to mock up: https://www.wireshark.org/tools/oui-lookup.html .为了澄清起见,这里是我试图模拟的网站的 model: https://www.wireshark.org/tools/oui-lookup.ZFC35FDC70D5FC69D269883A82EZA7 It is very simple in nature.它本质上非常简单。

Research:研究:

Convert Javascript array to python list? 将 Javascript 数组转换为 python 列表?

Remove duplicate values from JS array 从 JS 数组中删除重复值

Since the structure is not nested, you can match the array with a regular expression, then parse it with JSON, remove duplicate objects with filter in Python, and then replace with the deduplicated JSON string.由于结构没有嵌套,所以可以用正则表达式匹配数组,然后用JSON解析,用Python中的filter去除重复对象,然后替换为去重后的Z0ECD11C1D7A3BB87401FZD14A字符串。

Use array literal syntax ( [ and ] ) rather than new Array to keep things cleaner (best never to use new Array ):使用数组字面量语法( [] )而不是new Array来保持整洁(最好永远不要使用new Array ):

import re
import json
str = '''
var oui = [{
    "pfx": "000000",
    "mask": 24,
    "desc": "00:00:00   Officially Xerox, but 0:0:0:0:0:0 is more common"
},{
    "pfx": "000001",
    "mask": 24,
    "desc": "Xerox  Xerox Corporation"
},{
    "pfx": "000002",
    "mask": 24,
    "desc": "Xerox  Xerox Corporation"
},{
    "pfx": "000003",
    "mask": 24,
    "desc": "Xerox  Xerox Corporation"
},{
    "pfx": "000004",
    "mask": 24,
    "desc": "Xerox  Xerox Corporation"
},{
    "pfx": "000004",
    "mask": 24,
    "desc": "Let's pretend this is a repeat"
   }];
'''

def dedupe(match):
   jsonStr = match.group()
   list = json.loads(jsonStr)
   seenPfxs = set()
   def notDupe(obj):
        thisPfx = obj['pfx']
        if thisPfx in seenPfxs:
            return False
        seenPfxs.add(thisPfx)
        return True
   return json.dumps([obj for obj in list if notDupe(obj)])

dedupedStr = re.sub(r'(?s)\[[^\]]+\](?=;)', dedupe, str)
print(dedupedStr)

Output: Output:

var oui = [{"pfx": "000000", "mask": 24, "desc": "00:00:00   Officially Xerox, but 0:0:0:0:0:0 is more common"}, {"pfx": "000001", "mask": 24, "desc": "Xerox  Xerox Corporation"}, {"pfx": "000002", "mask": 24, "desc": "Xerox  Xerox Corporation"}, {"pfx": "000003", "mask": 24, "desc": "Xerox  Xerox Corporation"}, {"pfx": "000004", "mask": 24, "desc": "Xerox  Xerox Corporation"}];

If possible, you might consider storing the data in a separate tag, rather than being in inline Javascript - it'll be more maintainable.如果可能,您可能会考虑将数据存储在单独的标签中,而不是内联 Javascript - 这将更易于维护。 Eg, in your HTML, instead of例如,在您的 HTML 中,而不是

var oui = [{
    "pfx": "000000",
    "mask": 24,
    "desc": "00:00:00   Officially Xerox, but 0:0:0:0:0:0 is more common"
},{

consider something like考虑类似的东西

 var oui = JSON.parse(document.querySelector('[data-oui').textContent); console.log(oui);
 <script data-oui type="application/json">[{ "pfx": "000000", "mask": 24, "desc": "00:00:00 Officially Xerox, but 0:0:0:0:0:0 is more common" }]</script>

Then you don't have to dynamically change the Javascript, but only the <script data-oui type="application/json"> tag.那么您不必动态更改 Javascript,而只需<script data-oui type="application/json">标记。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM