簡體   English   中英

Python:如何從具有特定值的列表中創建嵌套字典

[英]Python: How to create nested dictionaries out of lists with specific values

對於冗長的帖子,我事先表示歉意,但是我已經確保它很容易理解並且非常清楚。

我的問題是這樣的:

如何使用指定的重復鍵從列表中創建嵌套詞典?

這是我想用虛構新聞文章中的數據制作的示例:

{'http://www.SomeNewsWebsite.com/Article12345': 
 {'Title': 'Trump Does Another Ridiculous Thing', 
  'Source': 'Some News Website', 
  'Thumbnail': 'SomeNewsWebsite.com/image12345'}} 

閱讀類似的文章 ,我看到人們做類似的事情,但是努力將這些想法移植到我自己的作品中。

我的問題到此為止。 在下面,我張貼了我的代碼和由所述代碼生成的示例列表,這就是我用來制作此嵌套字典的內容。 我的Github上也有它。

到目前為止,我可以使用以下代碼來獲取數據,剪切出重要的位,然后創建兩個列表-一個用於URL,一個用於標題。 然后,它使用Zip將它們組合成整潔的字典。

url = "http://www.reuters.com"

source = "Reuters"

thumbnail = "http://logok.org/wp-content/uploads/2014/04/Reuters-logo.png"


def soup():
    """ Fetches HTML from site and turns it into a bs4 object. """
    get_html = requests.get(url)
    html = get_html.text
    make_soup = BeautifulSoup(html, 'html.parser')
    return make_soup


# Tell bs4 where to find the important information (headlines, URLs)
important_data = (soup().select(".story-content > .story-title > a"))


# Turn that important data into a string so it may be parsed using RegEx
stringed_data = ' || '.join(str(v) for v in important_data)


def get_headline():
    """ Uses Regular Expressions to find headlines. Returns a list. """
    headline = re.findall(r'(?<=">)(.*?)(?=</a>)', stringed_data)
    return headline


def get_link():
    """ Uses Regular Expressions to find links. Returns a list. """
    link = re.findall(r'(?<=<a href=")(.*?)(?=")', stringed_data)
    return link

def build_dict():
    """ Combine everything into a tidy dictionary. """
    full_urls = [i if i.startswith('http') else url + i for i in get_link()]
    reuters_dictionary = dict(zip(get_headline(), full_urls))
    return full_urls

get_link()
get_headline()
soup()
build_dict()

運行時,此代碼將創建2個列表,然后創建一個字典。 示例數據如下所示:

List of titles:(29 items long)
['Trump strikes defiant tone ahead of debate', 'Matthew swamps North Carolina, still dangerous as it heads out to sea', "Tesla's Musk says will not have to raise funds in fourth-quarter", 'Suspect arrested in fatal shooting of two California police officers', 'Russia says U.S. actions threaten its national security', 'Western-backed coalition under pressure over Yemen raid', "Fed's Fischer says job gains solid, expects growth to pick up", "Thai king's condition unstable after hemodialysis treatment: palace", 'Pope names new group of cardinals, adding to potential successors', 'Palestinian kills two people in Jerusalem, then shot dead: police', "Commentary: House of Lies — the uncanny allure of 'Girl on the Train'", 'Earnings season begins as White House race heats up', 'Russia expects OPEC to ask non members to consider joining output curb', 'Banks ponder the meaning of life as Deutsche agonizes', 'IMF says still engaged with Greece, no decision yet on bailout role', 'Pound slump exacerbates Brexit impact for German exporters: DIHK', 'Iranian, Iraqi oil ministers will not attend Istanbul talks: sources', 'Ukraine military postpones withdrawal from town, cites rebel shelling', 'German police make new raid in hunt for refugee planning bomb attack', "South African President Zuma's rape accuser dies: family", 'Xi says China must speed up plans for domestic network technology', 'UberEats to expand to Berlin in 2017: Tagesspiegel', 'Beijing, Shanghai propose curbs on who can drive for ride-hailing services', 'Pressure on Trump likely to be intense at second debate with Clinton', "Sanders supporters seethe over Clinton's leaked remarks to Wall St.", 'Evangelical leaders stick with Trump, focus on defeating Clinton', 'Citi sells its Argentinian consumer business to Banco Santander', "Itaú to pay $220 million for Citigroup's Brazil assets", 'LafargeHolcim agrees sale of Chilean business Cemento Polpaico']


List of URLs: (29 items long)
['/article/us-usa-election-idUSKCN1290JZ', '/article/us-storm-matthew-idUSKCN129063', '/article/us-tesla-equity-solarcity-idUSKCN1290QW', '/article/us-california-police-shooting-idUSKCN1280YH', '/article/us-russia-usa-idUSKCN1290DP', '/article/us-yemen-security-coalition-pressure-idUSKCN1290JM', '/article/us-usa-fed-fischer-idUSKCN1290JB', '/article/us-thailand-king-idUSKCN1290R8', '/article/us-pope-cardinals-idUSKCN1290C9', '/article/us-israel-palestinians-violence-idUSKCN129070', '/article/us-society-entertainment-film-idUSKCN127229', '/article/us-usa-stocks-weekahead-idUSKCN1272HS', '/article/us-oil-opec-russia-idUSKCN1290KD', '/article/us-imf-g20-banks-idUSKCN1290DX', '/article/us-imf-g20-greece-idUSKCN1290R6', '/article/us-britain-eu-germany-idUSKCN1290TZ', '/article/us-oil-opec-istanbul-idUSKCN1290N2', '/article/us-ukraine-crisis-withdrawal-idUSKCN1290UL', '/article/us-germany-bomb-idUSKCN1290D2', '/article/us-safrica-zuma-idUSKCN1290SX', '/article/us-china-internet-security-idUSKCN1290LA', '/article/us-uber-germany-eats-idUSKCN1290OB', '/article/us-china-regulations-ride-hailing-idUSKCN1280EL', '/article/us-usa-election-debate-idUSKCN1290AS', '/article/us-usa-election-clinton-idUSKCN1280Z9', '/article/us-usa-election-trump-evangelicals-idUSKCN1280WE', '/article/us-citi-argentina-m-a-banco-santander-ri-idUSKCN1290SD', '/article/us-citibank-brasil-m-a-itau-unibco-hldg-idUSKCN1280HM', '/article/us-lafargeholcim-divestment-chile-idUSKCN1280BU']

Dictionary of titles and URLs: (29 items long)
{'Banks ponder the meaning of life as Deutsche agonizes': 'http://www.reuters.com/article/us-imf-g20-banks-idUSKCN1290DX', 'German police make new raid in hunt for refugee planning bomb attack': 'http://www.reuters.com/article/us-germany-bomb-idUSKCN1290D2', 'Suspect arrested in fatal shooting of two California police officers': 'http://www.reuters.com/article/us-california-police-shooting-idUSKCN1280YH', 'Evangelical leaders stick with Trump, focus on defeating Clinton': 'http://www.reuters.com/article/us-usa-election-trump-evangelicals-idUSKCN1280WE', 'Xi says China must speed up plans for domestic network technology': 'http://www.reuters.com/article/us-china-internet-security-idUSKCN1290LA', "Australia's Rinehart and China's Shanghai CRED agree on deal for Kidman cattle empire": 'http://www.reuters.com/article/us-australia-china-landsale-dakang-p-f-idUSKCN12908O', 'LafargeHolcim agrees sale of Chilean business Cemento Polpaico': 'http://www.reuters.com/article/us-lafargeholcim-divestment-chile-idUSKCN1280BU', 'Citi sells Argentinian consumer unit a day after Brazil sale': 'http://www.reuters.com/article/us-citi-argentina-m-a-banco-santander-ri-idUSKCN1290SD', 'Beijing, Shanghai propose curbs on who can drive for ride-hailing services': 'http://www.reuters.com/article/us-china-regulations-ride-hailing-idUSKCN1280EL', 'Pope names new group of cardinals, adding to potential successors': 'http://www.reuters.com/article/us-pope-cardinals-idUSKCN1290C9', "Commentary: House of Lies — the uncanny allure of 'Girl on the Train'": 'http://www.reuters.com/article/us-society-entertainment-film-idUSKCN127229', 'Iranian, Iraqi oil ministers will not attend Istanbul talks: sources': 'http://www.reuters.com/article/us-oil-opec-istanbul-idUSKCN1290N2', "South African President Zuma's rape accuser dies: family": 'http://www.reuters.com/article/us-safrica-zuma-idUSKCN1290SX', 'Palestinian kills two people in Jerusalem, then shot dead: police': 'http://www.reuters.com/article/us-israel-palestinians-violence-idUSKCN129070', 'Matthew swamps North Carolina, still dangerous as it heads out to sea': 'http://www.reuters.com/article/us-storm-matthew-idUSKCN129063', 'Western-backed coalition under pressure over Yemen raid': 'http://www.reuters.com/article/us-yemen-security-coalition-pressure-idUSKCN1290JM', 'Trump strikes defiant tone ahead of debate': 'http://www.reuters.com/article/us-usa-election-idUSKCN1290JZ', 'Russia says U.S. actions threaten its national security': 'http://www.reuters.com/article/us-russia-usa-idUSKCN1290DP', 'Pressure on Trump likely to be intense at second debate with Clinton': 'http://www.reuters.com/article/us-usa-election-debate-idUSKCN1290AS', "Sanders supporters seethe over Clinton's leaked remarks to Wall St.": 'http://www.reuters.com/article/us-usa-election-clinton-idUSKCN1280Z9', "Tesla's Musk says will not have to raise funds in fourth-quarter": 'http://www.reuters.com/article/us-tesla-equity-solarcity-idUSKCN1290QW', "Fed's Fischer says job gains solid, expects growth to pick up": 'http://www.reuters.com/article/us-usa-fed-fischer-idUSKCN1290JB', 'Ukraine military postpones withdrawal from town, cites rebel shelling': 'http://www.reuters.com/article/us-ukraine-crisis-withdrawal-idUSKCN1290UL', "Thai king's condition unstable after hemodialysis treatment: palace": 'http://www.reuters.com/article/us-thailand-king-idUSKCN1290R8', 'Earnings season begins as White House race heats up': 'http://www.reuters.com/article/us-usa-stocks-weekahead-idUSKCN1272HS', 'IMF says still engaged with Greece, no decision yet on bailout role': 'http://www.reuters.com/article/us-imf-g20-greece-idUSKCN1290R6', 'Pound slump exacerbates Brexit impact for German exporters: DIHK': 'http://www.reuters.com/article/us-britain-eu-germany-idUSKCN1290TZ', 'Russia expects OPEC to ask non members to consider joining output curb': 'http://www.reuters.com/article/us-oil-opec-russia-idUSKCN1290KD', 'UberEats to expand to Berlin in 2017: Tagesspiegel': 'http://www.reuters.com/article/us-uber-germany-eats-idUSKCN1290OB'}

為了清楚起見,我想使用此數據為標題和URL的每對創建一個字典,如下所示:

{'http://www.reuters.com/article/us-imf-g20-banks-idUSKCN1290DX': 
 {'Title': 'Banks ponder the meaning of life as Deutsche agonizes',
  'Source': 'Reuters', 
  'Thumbnail': 'http://logok.org/wp-content/uploads/2014/04/Reuters-logo.png'}}

非常感謝您抽出寶貴的時間閱讀,並在此先感謝您的幫助。

考慮一個字典理解:

newsdict = {v: {'Title': k, 
                'Source': 'Reuters', 
                'Thumbnail': 'http://logok.org/wp-content/uploads/2014/04/Reuters-logo.png'} 
           for k, v in reuters_dictionary.items()}

這應該給您想要的結果:

def build_dict():
    """ Combine everything into a tidy dictionary. """
    full_urls = [i if i.startswith('http') else url + i for i in get_link()]
    reuters_dictionary = {}
    for (headline, url) in zip(get_headline(), full_urls):
        reuters_dictionary[url] = {
            'Title': headline,
            'Source': source,
            'Thumbnail' : thumbnail
        }
    return full_urls # <- I think you want to do "return reuters_dictionary" here(?)

但是,這里沒有重復密鑰的內容。 為什么您覺得需要重復的鑰匙?

另外,您可能應該重構以刪除那些全局變量。

最后,如果您已經在使用BeatifulSoup,那么為什么之后又回到正則表達式呢? 我認為到處使用BeautifulSoup應該更可靠。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM