简体   繁体   中英

Create new data structure by combining list and dict

I have two objects, the first, items , which is a list of a lists where each list counts the frequency of a term in a document

[('lorem', 1), ('ipsum', 1), ('dolor', 1), ('sit', 1), ('amet', 1)]
[('consectetur', 1), ('adipiscing', 1), ('elit', 1), ('sed', 1), ('eiusmod', 1), ('tempor', 1), ('incididunt', 1), ('ut', 3), ('labore', 1), ('et', 1), ('dolore', 1), ('magna', 1), ('aliqua', 1), ('enim', 1), ('ad', 1), ('minim', 1), ('veniam', 1), ('quis', 1), ('nostrud', 1), ('exercitation', 1), ('ullamco', 1), ('laboris', 1), ('nisi', 1), ('aliquip', 1), ('ex', 1), ('ea', 1), ('commodo', 1), ('consequat', 1)]
[('duis', 1), ('aute', 1), ('irure', 1), ('dolor', 1), ('reprehenderit', 1), ('voluptate', 1), ('velit', 1), ('esse', 1), ('cillum', 1), ('dolore', 1), ('eu', 1), ('fugiat', 1), ('nulla', 1), ('pariatur', 1)]
[('excepteur', 1), ('sint', 1), ('occaecat', 1), ('cupidatat', 1), ('non', 1), ('proident', 1), ('sunt', 1), ('culpa', 1), ('qui', 1), ('officia', 1), ('deserunt', 1), ('mollit', 1), ('anim', 1), ('id', 1), ('est', 1), ('laborum', 1)]

And the second, document_frequency_dict : which is a dictionary showing the total amount of documents one term shows up in

{'sit': 1, 'amet': 1, 'dolor': 2, 'lorem': 1, 'ipsum': 1, 'nostrud': 1, 'tempor': 1, 'exercitation': 1, 'magna': 1, 'elit': 1, 'ut': 1, 'ex':
1, 'ad': 1, 'consequat': 1, 'incididunt': 1, 'sed': 1, 'laboris': 1, 'veniam': 1, 'et': 1, 'quis': 1, 'dolore': 2, 'labore': 1, 'minim': 1, 'ullamco': 1, 'eiusmod': 1, 'commodo': 1, 'adipiscing': 1, 'ea': 1, 'aliquip': 1, 'enim': 1, 'nisi': 1, 'consectetur': 1, 'aliqua': 1, 'voluptate': 1, 'reprehenderit': 1, 'eu': 1, 'aute': 1, 'cillum': 1, 'pariatur': 1, 'nulla': 1, 'duis': 1, 'velit': 1, 'fugiat': 1, 'irure': 1, 'esse': 1, 'proident': 1, 'sint': 1, 'officia': 1, 'sunt': 1, 'qui': 1, 'deserunt': 1, 'laborum': 1, 'excepteur': 1, 'anim': 1, 'cupidatat': 1, 'culpa': 1, 'id': 1, 'non': 1, 'mollit': 1, 'occaecat': 1, 'est': 1}

I need to combine these two items into one dictionary with the following shape: word: document_frequency, ((document_id, occurences in that document), (document_id, occurences in that document)), word: etc..

It should be noted that document_id derives from the

tags in the input file, which will always exist. I am assuming they will always be in order, only because I cannot conceive of a solution when they are out of order.

Taking for example the word dolor ...

'dolor': 2, (1, 1), (2, 1)

How can I accomplish the creation of this custom data structure?

The current code body is below:

 import nltk from nltk.tokenize import word_tokenize, RegexpTokenizer from nltk.corpus import stopwords import csv import operator import re import pandas import collections from collections import defaultdict, Counter import sys def remove_nums(arr): pattern = '[0-9]' arr = [re.sub(pattern, '', i) for i in arr] return arr # Main Program def main(): myfile = get_input("path") stop_words = list(stopwords.words('english')) p = r'<P ID=\\d+>(.*?)</P>' paras = RegexpTokenizer(p) num_paragraphs = len(paras.tokenize(myfile)) currFrequency = collections.Counter() #currFrequencies = [] id_num = 1 words = RegexpTokenizer(r'\\w+') document_frequency = collections.Counter() for para in paras.tokenize(myfile): lower = [word.lower() for word in words.tokenize(para)] no_integers = remove_nums(lower) dirty_tokens = [data for data in no_integers if data not in stop_words] tokens = [data for data in dirty_tokens if data.strip()] document_frequency.update(set(tokens)) for para in paras.tokenize(myfile): lower = [word.lower() for word in words.tokenize(para)] no_integers = remove_nums(lower) dirty_tokens = [data for data in no_integers if data not in stop_words] tokens = [data for data in dirty_tokens if data.strip()] currFrequencies = collections.Counter(tokens) d = dict(currFrequencies) items = list(d.items()) print(items) id_num += 1 print() document_frequency_dict = dict(document_frequency) print(document_frequency_dict) 

For reference, an example file is:

 <P ID=1> Lorem ipsum dolor sit amet </P> <P ID=2> consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </P> <P ID=3> Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </P> <P ID=4> Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. </P> <P ID=5> 654654 </P> 

the first, items, which is a list of a lists where each list counts the frequency of a term in a document

This is not actually the case; your code builds one of the intended inner lists each time through the loop, but does not put them into a list of lists. As seen here:

    d = dict(currFrequencies)
    items = list(d.items())
    print(items) # the list is printed, but not stored. It's overwritten each time.
    id_num += 1 # Nothing in the code actually uses this value!

In fact, the existing currFrequencies is a much more appropriate data structure for the next step, because it lets us directly answer the question, "given the histogram for a specific document, and a word, how many times does the word appear?".

You should try to build a dict of these collections.Counter instances, mapping from the ID key (you can read this from the original HTML as well) to the Counter. Once you have that, the next step is to get the pairs of (id, count) values, something like:

def counts_in_each_paragraph(per_paragraph_counts, word):
    return [
        # the id and the looked-up frequency
        (id, counter[word])
        # of each per-paragraph Counter
        for id, counter in per_paragraph_counts.items()
        # that contains a (non-zero) count for the word
        if word in counter
    ]

which you can then build into the final result, something like:

def full_histogram(per_paragraph_counts, overall_counts):
    return {
        # map the word to its overall count plus per-paragraph count pairs
        word: (count, (counts_in_each_paragraph(per_paragraph_counts, word))
        # across all of the overall-count data
        for word, count in overall_counts.items()
    }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM