简体   繁体   English

Python 未知模式查找

[英]Python Unknown pattern finding

Okay, basically what I want is to compress a file by reusing code and then at runtime replace missing code.好的,基本上我想要的是通过重用代码来压缩文件,然后在运行时替换丢失的代码。 What I've come up with is really ugly and slow, at least it works.我想出的东西真的很丑而且很慢,至少它有效。 The problem is that the file has no specific structure, for example 'aGVsbG8=\n', as you can see it's base64 encoding.问题是文件没有特定的结构,例如'aGVsbG8=\n',你可以看到它是base64 编码。 My function is really slow because the length of the file is 1700+ and it checks for patterns 1 character at the time.我的 function 真的很慢,因为文件的长度是 1700+,它当时检查模式 1 字符。 Please help me with new better code or at least help me with optimizing what I got:).请帮助我使用新的更好的代码,或者至少帮助我优化我得到的东西:)。 Anything that helps is welcome.欢迎任何有帮助的东西。 BTW i have already tried compression libraries but they didn't compress as good as my ugly function.顺便说一句,我已经尝试过压缩库,但它们的压缩效果不如我丑陋的功能。

def c_long(inp, cap=False, b=5):
    import re,string
    if cap is False: cap = len(inp)
    es = re.escape; le=len; ref = re.findall; ran = range; fi = string.find
    c = b;inpc = inp;pattern = inpc[:b]; l=[]
    rep = string.replace; ins = list.insert
    while True:
        if c == le(inpc) and le(inpc) > b+1: c = b; inpc = inpc[1:]; pattern = inpc[:b]
        elif le(inpc) <= b: break
        if c == cap: c = b; inpc = inpc[1:]; pattern = inpc[:b]
        p = ref(es(pattern),inp)
        pattern += inpc[c]
        if le(p) > 1 and le(pattern) >= b+1:
            if l == []: l = [[pattern,le(p)+le(pattern)]]
            elif le(ref(es(inpc[:c+2]),inp))+le(inpc[:c+2]) < le(p)+le(pattern):
                x = [pattern,le(p)+le(inpc[:c+1])]
                for i in ran(le(l)):
                    if x[1] >= l[i][1] and x[0][:-1] not in l[i][0]: ins(l,i,x); break
                    elif x[1] >= l[i][1] and x[0][:-1] in l[i][0]: l[i] = x; break
                inpc = inpc[:fi(inpc,x[0])] + inpc[le(x[0]):]
                pattern = inpc[:b]
                c = b-1
        c += 1
    d = {}; c = 0
    s = ran(le(l))
    for x in l: inp = rep(inp,x[0],'{%d}' % s[c]); d[str(s[c])] = x[0]; c += 1
    return [inp,d]

def decompress(inp,l): return apply(inp.format, [l[str(x)] for x in sorted([int(x) for x in l.keys()])])

The easiest way to compress base64-encoded data is to first convert it to binary data -- this will already save 25 percent of the storage space:压缩 base64 编码数据的最简单方法是首先将其转换为二进制数据——这已经节省了 25% 的存储空间:

>>> s = "YWJjZGVmZ2hpamtsbW5vcHFyc3R1dnd4eXo=\n"
>>> t = s.decode("base64")
>>> len(s)
37
>>> len(t)
26

In most cases, you can compress the string even further using some compression algorithm, like t.encode("bz2") or t.encode("zlib") .在大多数情况下,您可以使用一些压缩算法进一步压缩字符串,例如t.encode("bz2")t.encode("zlib")

A few remarks on your code: There are lots of factors that make the code hard to read: inconsistent spacing, overly long lines, meaningless variable names, unidiomatic code, etc. An example: Your decompress() function could be equivalently written as关于您的代码的几点说明:有很多因素使代码难以阅读:不一致的间距、过长的行、无意义的变量名、单一的代码等。例如:您的decompress() function 可以等效地写为

def decompress(compressed_string, substitutions):
    subst_list = [substitutions[k] for k in sorted(substitutions, key=int)]
    return compressed_string.format(*subst_list)

Now it's already much more obvious what it does.现在它的作用已经很明显了。 You could go one step further: Why is substitutions a dictionary with the string keys "0" , "1" etc.?您可以 go 更进一步:为什么用字符串键"0""1"substitutions字典? Not only is it strange to use strings instead of integers -- you don't need the keys at all, A simple list will do, and decompress() will simplify to使用字符串而不是整数不仅很奇怪——你根本不需要键,一个简单的列表就可以了, decompress()将简化为

def decompress(compressed_string, substitutions):
    return compressed_string.format(*substitutions)

You might think all this is secondary, but if you make the rest of your code equally readable, you will find the bugs in your code yourself.你可能认为这一切都是次要的,但如果你让你的代码的 rest 具有同样的可读性,你会自己发现代码中的错误。 (There are bugs -- it crashes for "abcdefgabcdefg" and many other strings.) 错误—— "abcdefgabcdefg"和许多其他字符串会崩溃。)

Typically one would pump the program through a compression algorithm optimized for text, then run that through exec , eg通常,人们会通过针对文本优化的压缩算法来抽取程序,然后通过exec运行该程序,例如

code="""..."""
exec(somelib.decompress(code), globals=???, locals=???)

It may be the case that .pyc / .pyo files are compressed already, and one could check by creating one with x="""aaaaaaaa""" , then increasing the length to x="""aaaaaaaaaaaaaaaaaaaaaaa...aaaa""" and seeing if the size changes appreciably.可能是.pyc / .pyo文件已经被压缩了,可以通过创建一个x="""aaaaaaaa"""来检查,然后将长度增加到x="""aaaaaaaaaaaaaaaaaaaaaaa...aaaa"""并查看大小是否有明显变化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM