简体   繁体   中英

Confusion regarding dictionary assignments and returns in python

I'm trying to modify some basic code for downloading and parsing SEC filings, but there's something being done in the parsing of the headers that I find completely baffling. I don't understand what's going on in the dictionary creation and header assignment of the following code:

def download_filing(filing):
    data=None
    try:
        data=open(filing).read()
    except:
        print 'Failed to get data...'

    if data==None: return None

    headers={}

    docs=[]
    docdata={}
    intext=False  
    inheaders=False
    headerstack=['','','','','']

    for line in data.split('\n'):
        if line.strip()=='<DOCUMENT>':
            # Beginning of a new document
            docdata={'type':None,'sequence':-1,'filename':None,'description':None,'text':''}
        elif line.strip()=='</DOCUMENT>':
            # End of a documents
            docs.append(docdata)
        elif line.strip()=='<TEXT>':
            # Text block
            intext=True
        elif line.strip()=='</TEXT>':
            # End of the text block
            intext=False
        elif line.strip().startswith('<SEC-HEADER>'):
            inheaders=True
        elif line.strip().startswith('</SEC-HEADER>'):
            inheaders=False
        elif inheaders and line.strip()!='':
            # Number of tabs before desc
            level=line.find(line.strip())
            sline=line.strip().replace(':','',1)

            # Find the dictionary level
            curdict=headers
            for i in range(level):
                curdict=curdict[headerstack[i]]

            # Determine if this is a field or a another level of fields
            if sline.find('\t')!=-1:
                curdict[sline.split('\t')[0]]=sline.split('\t')[-1]
            else:
                headerstack[level]=sline
                curdict.setdefault(sline,{})

        elif intext:
            docdata['text']+=line+'\n'
        else:
            # See if this is document metadata
            for header in DOC_HEADERS:
                if line.startswith(header):
                    field=DOC_HEADERS[header]
                    docdata[field]=line[len(header):]

    return headers,docs

The goal is to parse through an sec filing like this: http://www.sec.gov/Archives/edgar/data/356213/0000898430-95-000806.txt

and return a tuple which contains a dictionary of dictionaries as "headers" and a list of dictionaries in "docs". Most of it appears pretty straightforward to me. Open the filing, read it line by line, and generate some control flow which tells the function whether it's in the header part of the document or the text part of the document. I also understand the list creation algorithm at the end which appends all of the "docdata" together.

However the headers part is blowing my mind. I more or less understand how the header parser is trying to create nests of dictionaries based on the number of tabs before each block item, and then determining where to stick each key. What I don't understand is how it is filling this into the "headers" variable. It appears to be assigning headers to curdict, which seems completely backwards to me. The program defines headers as an empty dict at the top, then for each line, assigns assigns this empty dictionary to curdict and then goes forth. It then returns headers which appears to never have been formally manipulated again.

I'm guessing that this my complete lack of understanding of how object assignment works in Python. I'm sure it's really obvious, but I'm not advanced enough to have seen programs written this way.

headers is a nested tree of dictionaries. The loop that assigns to curdict goes down to the Nth level in this tree, using headerstack[i] as the key for each level. It starts by initializing curdict to the top-level headers , then on each iteration it resets it to the child dictionary based on the next item in headerstack .

In Python, as in most OO languages, object assignment is by reference, not by copying. So once the final assignment to curdict is done, it contains a reference to one of the nested dictionaries. Then when it does:

curdict[sline.split('\t')[0]]=sline.split('\t')[-1]

it fills in that dictionary element, which is still part of the full tree that headers refers to.

For example, if headerstack contains ['a', 'b', 'c', 'd'] and level = 3 , then the loop will set curdict to a reference to headers['a']['b']['c'] . If sline is foo\\tbar , the above assignment will then be equivalent to:

headers['a']['b']['c']['foo'] = 'bar';

I'll show how this happens, step-by-step. At the start of the loop, we have:

curdict == headers

During the first iteration of the loop:

i = 1
curdict = curdict[headerstack[i]]

is equivalent to:

curdict = headers['a']

On the next iteration:

i = 2
curdict = curdict[headerstack[i]]

is equivalent to:

curdict = curdict['b']

which is equivalent to:

curdict = headers['a']['b']

On the next (final) loop ieration:

i = 3
curdict = curdict[headerstack[i]]

which is equivalent to:

curdict = curdict['c']

which is:

curdict = headers['a']['b']['c']

So at this point, curdict refers to the same dictionary that headers['a']['b']['c'] does. Anything you do to the dictionary in curdict also happens to the dictionary in headers . So when you do:

curdict['foo'] = 'bar'

it's equivalent to doing:

headers['a']['b']['c']['foo'] = 'bar'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM