簡體   English   中英

通過組合列表和字典來創建新的數據結構

[英]Create new data structure by combining list and dict

我有兩個對象,第一個是items ,這是一個列表列表,其中每個列表都計算文檔中術語的出現頻率

[('lorem', 1), ('ipsum', 1), ('dolor', 1), ('sit', 1), ('amet', 1)]
[('consectetur', 1), ('adipiscing', 1), ('elit', 1), ('sed', 1), ('eiusmod', 1), ('tempor', 1), ('incididunt', 1), ('ut', 3), ('labore', 1), ('et', 1), ('dolore', 1), ('magna', 1), ('aliqua', 1), ('enim', 1), ('ad', 1), ('minim', 1), ('veniam', 1), ('quis', 1), ('nostrud', 1), ('exercitation', 1), ('ullamco', 1), ('laboris', 1), ('nisi', 1), ('aliquip', 1), ('ex', 1), ('ea', 1), ('commodo', 1), ('consequat', 1)]
[('duis', 1), ('aute', 1), ('irure', 1), ('dolor', 1), ('reprehenderit', 1), ('voluptate', 1), ('velit', 1), ('esse', 1), ('cillum', 1), ('dolore', 1), ('eu', 1), ('fugiat', 1), ('nulla', 1), ('pariatur', 1)]
[('excepteur', 1), ('sint', 1), ('occaecat', 1), ('cupidatat', 1), ('non', 1), ('proident', 1), ('sunt', 1), ('culpa', 1), ('qui', 1), ('officia', 1), ('deserunt', 1), ('mollit', 1), ('anim', 1), ('id', 1), ('est', 1), ('laborum', 1)]

第二個是document_frequency_dict :這是一個字典,顯示一個術語在其中顯示的文檔總數

{'sit': 1, 'amet': 1, 'dolor': 2, 'lorem': 1, 'ipsum': 1, 'nostrud': 1, 'tempor': 1, 'exercitation': 1, 'magna': 1, 'elit': 1, 'ut': 1, 'ex':
1, 'ad': 1, 'consequat': 1, 'incididunt': 1, 'sed': 1, 'laboris': 1, 'veniam': 1, 'et': 1, 'quis': 1, 'dolore': 2, 'labore': 1, 'minim': 1, 'ullamco': 1, 'eiusmod': 1, 'commodo': 1, 'adipiscing': 1, 'ea': 1, 'aliquip': 1, 'enim': 1, 'nisi': 1, 'consectetur': 1, 'aliqua': 1, 'voluptate': 1, 'reprehenderit': 1, 'eu': 1, 'aute': 1, 'cillum': 1, 'pariatur': 1, 'nulla': 1, 'duis': 1, 'velit': 1, 'fugiat': 1, 'irure': 1, 'esse': 1, 'proident': 1, 'sint': 1, 'officia': 1, 'sunt': 1, 'qui': 1, 'deserunt': 1, 'laborum': 1, 'excepteur': 1, 'anim': 1, 'cupidatat': 1, 'culpa': 1, 'id': 1, 'non': 1, 'mollit': 1, 'occaecat': 1, 'est': 1}

我需要將這兩項合並成一個具有以下形狀的字典: word: document_frequency, ((document_id, occurences in that document), (document_id, occurences in that document)), word: etc..

應該注意的是document_id源自

輸入文件中的標記,該標記將始終存在。 假設他們永遠是為了,只是因為我無法想象一個解決方案時,他們的順序進行。

dolor一詞為例...

'dolor': 2, (1, 1), (2, 1)

如何完成此自定義數據結構的創建?

當前代碼體如下:

 import nltk from nltk.tokenize import word_tokenize, RegexpTokenizer from nltk.corpus import stopwords import csv import operator import re import pandas import collections from collections import defaultdict, Counter import sys def remove_nums(arr): pattern = '[0-9]' arr = [re.sub(pattern, '', i) for i in arr] return arr # Main Program def main(): myfile = get_input("path") stop_words = list(stopwords.words('english')) p = r'<P ID=\\d+>(.*?)</P>' paras = RegexpTokenizer(p) num_paragraphs = len(paras.tokenize(myfile)) currFrequency = collections.Counter() #currFrequencies = [] id_num = 1 words = RegexpTokenizer(r'\\w+') document_frequency = collections.Counter() for para in paras.tokenize(myfile): lower = [word.lower() for word in words.tokenize(para)] no_integers = remove_nums(lower) dirty_tokens = [data for data in no_integers if data not in stop_words] tokens = [data for data in dirty_tokens if data.strip()] document_frequency.update(set(tokens)) for para in paras.tokenize(myfile): lower = [word.lower() for word in words.tokenize(para)] no_integers = remove_nums(lower) dirty_tokens = [data for data in no_integers if data not in stop_words] tokens = [data for data in dirty_tokens if data.strip()] currFrequencies = collections.Counter(tokens) d = dict(currFrequencies) items = list(d.items()) print(items) id_num += 1 print() document_frequency_dict = dict(document_frequency) print(document_frequency_dict) 

供參考,示例文件為:

 <P ID=1> Lorem ipsum dolor sit amet </P> <P ID=2> consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </P> <P ID=3> Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </P> <P ID=4> Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. </P> <P ID=5> 654654 </P> 

第一個是項目,這是列表的列表,其中每個列表都計算文檔中術語的出現頻率

實際上並非如此。 您的代碼每次通過循環都會構建一個預期的內部列表之一,但不會將它們放入列表列表中。 如此處所示:

    d = dict(currFrequencies)
    items = list(d.items())
    print(items) # the list is printed, but not stored. It's overwritten each time.
    id_num += 1 # Nothing in the code actually uses this value!

實際上,現有的currFrequencies是下一步更合適的數據結構,因為它可以讓我們直接回答以下問題:“給定特定文檔的直方圖和一個單詞,單詞出現多少次?”。

您應該嘗試構建這些collections.Counter的字典.Counter實例,從ID鍵(您也可以從原始HTML讀取)映射到Counter。 一旦有了這些,下一步就是獲取(id,count)值對,例如:

def counts_in_each_paragraph(per_paragraph_counts, word):
    return [
        # the id and the looked-up frequency
        (id, counter[word])
        # of each per-paragraph Counter
        for id, counter in per_paragraph_counts.items()
        # that contains a (non-zero) count for the word
        if word in counter
    ]

然后您可以將其構建到最終結果中,例如:

def full_histogram(per_paragraph_counts, overall_counts):
    return {
        # map the word to its overall count plus per-paragraph count pairs
        word: (count, (counts_in_each_paragraph(per_paragraph_counts, word))
        # across all of the overall-count data
        for word, count in overall_counts.items()
    }

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM