簡體   English   中英

正則表達式處理輸入數據,隨后使用Python和直方圖進行可視化

[英]Regex processing of input data, subsequent visualization using Python and histogram

目前,我有成千上萬種以下形式的記錄:

0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000000   82557
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000001   128805
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000002   94990
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000003   121020
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000004   58111390
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000005   167079
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000006   130795
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000007   236926
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000008   24754217
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000009   75407
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000010   136461
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000011   136748
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000012   146258
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000013   381091
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000014   129815

在簡單的電子表格程序中,可視化此數據的一些記錄很簡單,如下所示:

在此處輸入圖片說明

我一直在嘗試修改此代碼以使其可視化,但迄今未成功:

# Call like this:
# 
# python opcode-farmer.py 'tst21' '6005600401'
# 
import re
import numpy as np
import matplotlib.pyplot as plt
import csv
import sys
import pprint
import itertools 
import subprocess
import collections

def my_test_func(filename, data):
    with open(filename, 'w') as fd:
        fd.write(data)
        fd.write('\n')
    return subprocess.check_output(['evm', 'disasm', filename])

if '__main__' == __name__:

    file_name = sys.argv[1] 
    byte_code = sys.argv[2]
    status = my_test_func(file_name, byte_code)

    opcodes_list = list()

    for element in status.split('\n'):
        result = re.search(r"\b[A-Z].+", element)
        if result:
            # eliminate individual 0x05 specification 
            simple_opcode = re.sub(r'\s(.*)', '', result.group(0))
            opcodes_list.append(simple_opcode)

    # Count up the values
    cnt = collections.Counter()
    for word in opcodes_list:
         cnt[word] += 1
    print(cnt)

    # THRESHOLD
    threshold = 30
    cnt = collections.Counter(record for record in cnt.elements() if cnt[record] >= threshold)


    # VISUALIZATION

    # Transpose the data to get the x and y values
    labels, values = zip(*cnt.items())


    # generates this representation: [0 1 2 3 4 5 6 7], 
    # from the number of the length
    indexes = np.arange(len(labels))
    width = 1

    plt.xlabel("most common opcodes in tx")
    plt.ylabel("number of occurances")

    plt.bar(indexes, values, width)
    plt.xticks(indexes + width * 0.5, labels)
    plt.show()

如何遍歷上面指定的那些輸入記錄,以便消除前綴0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_ ,然后在Python 0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_它們呈現為直方圖?

您可以嘗試以下方法:

import re
data = [b for b in [re.split("\s+", i.strip('\n')) for i in open('filename.txt')] if len(b) > 1]
final_data = [[int(re.sub("\w+_", '', a)), int(b)] for a, b in data]

我在您提供的數據上運行了此代碼,並得到了輸出:

[[0, 82557], [1, 128805], [2, 94990], [3, 121020], [4, 58111390], [5, 167079], [6, 130795], [7, 236926], [8, 24754217], [9, 75407], [10, 136461], [11, 136748], [12, 146258], [13, 381091], [14, 129815]]

全部放在一起...

import re
import re
import numpy as np
import matplotlib.pyplot as plt
import csv
import sys
import pprint
import itertools 
import subprocess
import collections


data = [b for b in [re.split("\s+", i.strip('\n')) for i in open('40000_output.txt')] if len(b) > 1]
final_data = [[int(re.sub("\w+_", '', a)), int(b)] for a, b in data]


# VISUALIZATION

# Transpose the data to get the x and y values
labels, values = zip(*final_data)


# generates this representation: [0 1 2 3 4 5 6 7], 
# from the number of the length
indexes = np.arange(len(labels))
width = 1

plt.xlabel("most common opcodes in tx")
plt.ylabel("number of occurances")

plt.bar(indexes, values, width)
plt.xticks(indexes + width * 0.5, labels)
plt.show()

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM