简体   繁体   English

将列表列表转换为字典

[英]Converting a list of list into a dictionary

I have data file, which looks like this, 我有数据文件,看起来像这样,

["Arts & Entertainment", "Arts & Entertainment / Animation & Comics", "Arts & Entertainment / Books & Literature", "Arts & Entertainment / Celebrity/Gossip", "Arts & Entertainment / Fine Art", "Arts & Entertainment / Humor", "Arts & Entertainment / Movies", "Arts & Entertainment / Movies / Action", "Arts & Entertainment / Movies / Comedy", "Arts & Entertainment / Movies / Documentary", "Arts & Entertainment / Movies / Drama", "Arts & Entertainment / Movies / Horror", "Arts & Entertainment / Music", "Arts & Entertainment / Music / Alternative Music", "Arts & Entertainment / Music / Blues", "Arts & Entertainment / Music / Christian Music", "Arts & Entertainment / Music / Classic Rock", "Arts & Entertainment / Music / Classical Music", "Arts & Entertainment / Music / Country Music", "Arts & Entertainment / Music / Electronic Dance Music", "Arts & Entertainment / Music / Heavy Metal", "Arts & Entertainment / Music / Pop Music", "Arts & Entertainment / Music / Rap", "Arts & Entertainment / Radio Stations", "Arts & Entertainment / Television", "Arts & Entertainment / Television / Game Show", "Arts & Entertainment / Television / Kids", "Arts & Entertainment / Television / News", "Arts & Entertainment / Television / Reality", "Arts & Entertainment / Television / Science", "Arts & Entertainment / Television / Sitcom", "Arts & Entertainment / Television / Soap Opera", "Arts & Entertainment / Television / Talk Show", "Autos", "Autos / 4-Wheel Drive/SUVs", "Autos / Buying/Selling Cars", "Autos / Certified Pre-Owned", "Autos / Convertible", "Autos / Coupe", "Autos / Crossover", "Autos / Diesel", "Autos / Electric Vehicles", "Autos / Hatchback", "Autos / Hybrid", "Autos / Luxury", "Autos / Maintenance", "Autos / Maintenance / Parts", "Autos / Maintenance / Repair", "Autos / MiniVan", "Autos / Motorcycles", "Autos / Off-Road Vehicles", "Autos / Road-Side Assistance", "Autos / Sedan", "Autos / Trucks", "Autos / Trucks / Pickup", "Autos / Vintage Cars", "Autos / Wagon", "Business & Industry", "Business & Industry / Advertising", "Business & Industry / Agriculture", "Business & Industry / Biotech/Biomedical", "Business & Industry / Business Software", "Business & Industry / Construction", "Business & Industry / Construction / Composites & Plastics", "Business & Industry / Forestry", "Business & Industry / Government", "Business & Industry / Green Solutions", "Business & Industry / Human Resources", "Business & Industry / Logistics", "Business & Industry / Marketing", "Business & Industry / Metals", "Business & Industry / Non-Profit Organizations", "Business & Industry / Power Industry", "Business & Industry / Public Services", "Business & Industry / Public Services / Emergency Services", "Business & Industry / Public Services / Waste Management", "Business & Industry / Purchasing", "Business & Industry / Retail Industry", "Business & Industry / Small Business", "Business & Industry / Telecom", "Career", "Career / Career Planning", "Career / Job Search", "Career / Job Search / Resume Writing/Advice", "Career / Telecommuting", "Career / U.S. Military", "Education", "Education / Business School", "Education / College Education", "Education / College Education / Admissions", "Education / College Education / College Life", "Education / Continuing Education", "Education / Distance Learning", "Education / Financial Aid", "Education / Financial Aid / Scholarships", "Education / Graduate School", "Education / Homeschooling", "Education / Language Learning", "Education / Language Learning / English as a 2nd Language", "Education / Primary Education", "Education / Secondary Education", "Education / Special Education", "Finance & Money", "Finance & Money / Credit/Debt & Loans", "Finance & Money / Day Trading", "Finance & Money / Exchange Traded Funds", "Finance & Money / Financial News", "Finance & Money / Financial Planning", "Finance & Money / Financial Planning / Retirement Planning", "Finance & Money / Financial Planning / Tax Planning", "Finance & Money / Foreign Exchange Trading", "Finance & Money / Hedge Fund", "Finance & Money / Insurance", "Finance & Money / Investing", "Finance & Money / Mutual Funds", "Finance & Money / Options", "Finance & Money / Stocks", "Food & Drink", "Food & Drink / Barbecues & Grilling", "Food & Drink / Beverages", "Food & Drink / Beverages / Cocktails/Beer", "Food & Drink / Beverages / Coffee/Tea", "Food & Drink / Beverages / Wine", "Food & Drink / Cuisine-Specific", "Food & Drink / Cuisine-Specific / American Cusine", "Food & Drink / Cuisine-Specific / Cajun/Creole", "Food & Drink / Cuisine-Specific / Chinese Cuisine", "Food & Drink / Cuisine-Specific / French Cuisine", "Food & Drink / Cuisine-Specific / Italian Food", "Food & Drink / Cuisine-Specific / Japanese Food", "Food & Drink / Cuisine-Specific / Mexican Cuisine", "Food & Drink / Desserts & Baking", "Food & Drink / Health/LowFat Cooking", "Food & Drink / Organic Food", "Food & Drink / Vegetarian", "Health & Fitness", "Health & Fitness / A.D.D.", "Health & Fitness / AIDS/HIV", "Health & Fitness / Allergies", "Health & Fitness / Alternative Medicine", "Health & Fitness / Alzheimer\\'s Disease", "Health & Fitness / Arthritis", "Health & Fitness / Asthma", "Health & Fitness / Autism/PDD", "Health & Fitness / Bipolar Disorder", "Health & Fitness / Brain Tumor", "Health & Fitness / Cancer", "Health & Fitness / Cancer / Breast Cancer", "Health & Fitness / Cancer / Lung Cancer", "Health & Fitness / Cancer / Prostate Cancer", "Health & Fitness / Cholesterol", "Health & Fitness / Chronic Fatigue Syndrome", "Health & Fitness / Chronic Obstructive Pulmonary Disease", "Health & Fitness / Chronic Pain", "Health & Fitness / Cold & Flu", "Health & Fitness / Deafness", "Health & Fitness / Dental Care", "Health & Fitness / Depression", "Health & Fitness / Dermatology", "Health & Fitness / Diabetes", "Health & Fitness / Epilepsy", "Health & Fitness / Exercise", "Health & Fitness / GERD/Acid Reflux", "Health & Fitness / Headaches/Migraines", "Health & Fitness / Heart Disease", "Health & Fitness / Heart Disease / Women\\'s Heart Disease", "Health & Fitness / Hepatitis", "Health & Fitness / Herbs for Health", "Health & Fitness / Holistic Healing", "Health & Fitness / Hypertension", "Health & Fitness / IBS/Crohn\\'s Disease", "Health & Fitness / Incest/Abuse Support", "Health & Fitness / Incontinence", "Health & Fitness / Infertility", "Health & Fitness / Men\\'s Health", "Health & Fitness / Nursing", "Health & Fitness / Nutrition", "Health & Fitness / Orthopedics", "Health & Fitness / Orthopedics / Sports Medicine", "Health & Fitness / Panic/Anxiety Disorders", "Health & Fitness / Pediatrics", "Health & Fitness / Pharmaceutical", "Health & Fitness / Physical Therapy", "Health & Fitness / Psychology/Psychiatry", "Health & Fitness / Senior Health", "Health & Fitness / Sexuality", "Health & Fitness / Sleep Disorders", "Health & Fitness / Smoking Cessation", "Health & Fitness / Substance Abuse", "Health & Fitness / Substance Abuse / Alcoholism", "Health & Fitness / Thyroid Disease", "Health & Fitness / Weight Loss", "Health & Fitness / Women\\'s Health", "Hobbies & Games", "Hobbies & Games / Arts & Crafts", "Hobbies & Games / Arts & Crafts / Beadwork", "Hobbies & Games / Arts & Crafts / Drawing/Sketching", "Hobbies & Games / Arts & Crafts / Needlework", "Hobbies & Games / Arts & Crafts / Painting", "Hobbies & Games / Arts & Crafts / Photography", "Hobbies & Games / Arts & Crafts / Woodworking", "Hobbies & Games / Astrology", "Hobbies & Games / Birdwatching", "Hobbies & Games / BoardGames/Puzzles", "Hobbies & Games / Candle & Soap Making", "Hobbies & Games / Card Games", "Hobbies & Games / Chess", "Hobbies & Games / Cigars", "Hobbies & Games / Collecting", "Hobbies & Games / Collecting / Antiques", "Hobbies & Games / Collecting / Book Collecting", "Hobbies & Games / Collecting / Miniatures", "Hobbies & Games / Collecting / Stamps & Coins", "Hobbies & Games / Creative Writing", "Hobbies & Games / Getting Published", "Hobbies & Games / Home Recording", "Hobbies & Games / Inventors & Patents", "Hobbies & Games / Learning a Musical Instrument", "Hobbies & Games / Learning a Musical Instrument / Guitar", "Hobbies & Games / Magic & Illusion", "Hobbies & Games / Paranormal Phenomena", "Hobbies & Games / Sci-Fi & Fantasy", "Hobbies & Games / Video Games", "Hobbies & Games / Video Games / Nintendo", "Hobbies & Games / Video Games / PSP", "Hobbies & Games / Video Games / Playstation", "Hobbies & Games / Video Games / RPG", "Hobbies & Games / Video Games / Racing", "Hobbies & Games / Video Games / X-Box", "Home & Garden", "Home & Garden / Appliances", "Home & Garden / Environmental Safety", "Home & Garden / Gardening/Landscaping", "Home & Garden / Home Repair", "Home & Garden / Interior Decorating", "News & Current Affairs", "News & Current Affairs / Law & Politics", "News & Current Affairs / Law & Politics / Immigration", "News & Current Affairs / Law & Politics / Legal Issues", "News & Current Affairs / Law & Politics / U.S. Government Resources", "Parenting & Family", "Parenting & Family / Adoption", "Parenting & Family / Babies & Toddlers", "Parenting & Family / Daycare/Pre-School", "Parenting & Family / Parenting Children", "Parenting & Family / Parenting Teens", "Parenting & Family / Pregnancy", "Parenting & Family / Special Needs Kids", "Pets", "Pets / Aquariums", "Pets / Cats", "Pets / Dogs", "Pets / Veterinary Medicine", "Real Estate", "Real Estate / Apartments", "Real Estate / Architecture", "Real Estate / Buying/Selling Homes", "Religion", "Religion / Alternative Religions", "Religion / Atheism/Agnosticism", "Religion / Buddhism", "Religion / Catholicism", "Religion / Christianity", "Religion / Hinduism", "Religion / Islam", "Religion / Judaism", "Religion / Latter-Day Saints", "Religion / Pagan/Wiccan", "Science", "Science / Astronomy", "Science / Biology", "Science / Chemistry", "Science / Geology", "Science / Physics", "Sensitive Content", "Sensitive Content / Gambling", "Sensitive Content / Gambling / Sports Gambling", "Society", "Society / Dating", "Society / Divorce", "Society / Gay Life", "Society / Marriage", "Society / Senior Living", "Society / Weddings", "Sports & Recreation", "Sports & Recreation / Auto Racing", "Sports & Recreation / Auto Racing / NASCAR Racing", "Sports & Recreation / Baseball", "Sports & Recreation / Basketball", "Sports & Recreation / Bicycling", "Sports & Recreation / Bicycling / Mountain Biking", "Sports & Recreation / Bodybuilding", "Sports & Recreation / Boxing", "Sports & Recreation / Canoeing/Kayaking", "Sports & Recreation / Cheerleading", "Sports & Recreation / Climbing", "Sports & Recreation / College Sports", "Sports & Recreation / Cricket", "Sports & Recreation / Figure Skating", "Sports & Recreation / Fishing", "Sports & Recreation / Fishing / Fly Fishing", "Sports & Recreation / Fishing / Freshwater Fishing", "Sports & Recreation / Fishing / Game & Fish", "Sports & Recreation / Fishing / Saltwater Fishing", "Sports & Recreation / Football", "Sports & Recreation / Golf", "Sports & Recreation / Horses", "Sports & Recreation / Horses / Horse Racing", "Sports & Recreation / Hunting/Shooting", "Sports & Recreation / Ice Hockey", "Sports & Recreation / Inline Skating", "Sports & Recreation / Martial Arts", "Sports & Recreation / Olympics", "Sports & Recreation / Paintball", "Sports & Recreation / Rodeo", "Sports & Recreation / Rugby", "Sports & Recreation / Running/Walking", "Sports & Recreation / Sailing", "Sports & Recreation / Scuba Diving", "Sports & Recreation / Skateboarding", "Sports & Recreation / Skiing", "Sports & Recreation / Snowboarding", "Sports & Recreation / Soccer", "Sports & Recreation / Surfing/Bodyboarding", "Sports & Recreation / Swimming", "Sports & Recreation / Table Tennis/Ping-Pong", "Sports & Recreation / Tennis", "Sports & Recreation / Volleyball", "Sports & Recreation / Waterski/Wakeboard", "Sports & Recreation / Yachting", "Style & Fashion", "Style & Fashion / Body Art", "Style & Fashion / Cosmetics", "Style & Fashion / Fashion", "Style & Fashion / Jewelry", "Technology & Computing", "Technology & Computing / Cameras & Camcorders", "Technology & Computing / Cell Phones", "Technology & Computing / Computer Certification", "Technology & Computing / Computer Networking", "Technology & Computing / Computer Peripherals", "Technology & Computing / Computer Security", "Technology & Computing / Computer Security / Antivirus Software", "Technology & Computing / Computer Security / Network Security", "Technology & Computing / Databases", "Technology & Computing / Graphics", "Technology & Computing / Graphics / 3-D Graphics", "Technology & Computing / Graphics / Animation", "Technology & Computing / Graphics / Desktop Publishing", "Technology & Computing / Graphics / Desktop Video", "Technology & Computing / Graphics / Web Design/HTML", "Technology & Computing / Home Theater Systems", "Technology & Computing / Operating Systems", "Technology & Computing / Operating Systems / Linux", "Technology & Computing / Operating Systems / Mac OS", "Technology & Computing / Operating Systems / Unix", "Technology & Computing / Operating Systems / Windows", "Technology & Computing / Portable Device", "Technology & Computing / Programming", "Technology & Computing / Programming / C/C++", "Technology & Computing / Programming / Java", "Technology & Computing / Programming / JavaScript", "Technology & Computing / Programming / Visual Basic", "Travel", "Travel / Adventure Travel", "Travel / Africa", "Travel / Air Travel", "Travel / Asia", "Travel / Asia / Japan", "Travel / Australia & New Zealand", "Travel / Bed & Breakfasts", "Travel / Budget Travel", "Travel / Business Travel", "Travel / Camping", "Travel / Canada", "Travel / Caribbean", "Travel / Cruises", "Travel / Europe", "Travel / Europe / Eastern Europe", "Travel / Europe / France", "Travel / Europe / Greece", "Travel / Europe / Italy", "Travel / Europe / United Kingdom", "Travel / Honeymoons/Getaways", "Travel / Hotels", "Travel / Mexico & Central America", "Travel / National Parks", "Travel / South America", "Travel / Spas", "Travel / Theme Parks", "Travel / United States", "Travel / United States / California", "Travel / United States / Florida", "Travel / United States / Hawaii", "Travel / United States / Las Vegas, Nevada", "Travel / United States / Manhattan, New York", "Travel / United States / New England", "Travel / United States / Texas", "Travel / Weather"]

I clean up the data file and I split it, so that it looks something like this, 我清理数据文件并将其拆分,以便它看起来像这样,

['Arts & Entertainment']
['Arts & Entertainment', 'Animation & Comics']
['Arts & Entertainment', 'Books & Literature']
['Arts & Entertainment', 'Celebrity Gossip']
['Arts & Entertainment', 'Fine Art']
['Arts & Entertainment', 'Humor']
['Arts & Entertainment', 'Movies']
['Arts & Entertainment', 'Movies', 'Action']
['Arts & Entertainment', 'Movies', 'Comedy']
['Arts & Entertainment', 'Movies', 'Documentary']
['Arts & Entertainment', 'Movies', 'Drama']
['Arts & Entertainment', 'Movies', 'Horror']
['Arts & Entertainment', 'Music']
['Arts & Entertainment', 'Music', 'Alternative Music']
['Arts & Entertainment', 'Music', 'Blues']
['Arts & Entertainment', 'Music', 'Christian Music']
['Arts & Entertainment', 'Music', 'Classic Rock']
['Arts & Entertainment', 'Music', 'Classical Music']
['Arts & Entertainment', 'Music', 'Country Music']
['Arts & Entertainment', 'Music', 'Electronic Dance Music']
['Arts & Entertainment', 'Music', 'Heavy Metal']
['Arts & Entertainment', 'Music', 'Pop Music']
['Arts & Entertainment', 'Music', 'Rap']
['Arts & Entertainment', 'Radio Stations']
['Arts & Entertainment', 'Television']
['Arts & Entertainment', 'Television', 'Game Show']
['Arts & Entertainment', 'Television', 'Kids']
['Arts & Entertainment', 'Television', 'News']
['Arts & Entertainment', 'Television', 'Reality']
['Arts & Entertainment', 'Television', 'Science']
['Arts & Entertainment', 'Television', 'Sitcom']
['Arts & Entertainment', 'Television', 'Soap Opera']
['Arts & Entertainment', 'Television', 'Talk Show']...

Now, I'm trying to convert the the list objects into a dictionary that looks like this, 现在,我正在尝试将列表对象转换为看起来像这样的字典,

{
    "Arts & Entertainment": {
        "Animation & Comics": {}, 
        "Books & Literature": {}, 
        "Celebrity Gossip": {}, 
        "Fine Art": {}, 
        "Humor": {}, 
        "Movies": {
            "Horror": {},
            "Action": {},
            "Comedy": {}, ...
        }, ...
}

The problem is I can't figure out how to not override my subcategories, In the example above, the Movies sub key has three categories with it, however when I run my code, which is below it just has the key of "Horror" in it and that's because Horror is the last element in the last element of the last list in that category. 问题是我无法弄清楚如何不覆盖我的子类别。在上面的例子中,Movies子键有三个类别,但是当我运行我的代码时,它下面只有“恐怖”键在它,这是因为恐怖是该类别中最后一个列表的最后一个元素中的最后一个元素。 Example of what I'm getting: 我得到的例子:

{
    "Arts & Entertainment": {
        "Animation & Comics": {}, 
        "Books & Literature": {}, 
        "Celebrity Gossip": {}, 
        "Fine Art": {}, 
        "Humor": {}, 
        "Movies": {
            "Horror": {} # notice there are no other categories in the movies section
        }, ...
}

Code I've tried: 代码我尝试过:

def cleanup_contextweb():
  contextweb_file_path = directory_path + raw_file_names[1]
  tree = {}
  with open(contextweb_file_path, 'r') as contextweb_file:
    cats = contextweb_file.read().replace('Manhattan, New York', 'Manhattan New York').replace('Las Vegas, Nevada', 'Las Vegas Nevada').replace('Celebrity/Gossip', 'Celebrity Gossip').replace('Atheism/Agnosticism', 'Atheism Agnosticism').replace('Pagan/Wiccan', 'Pagan Wiccan').split(',')
    #cats = re.sub(r'"|\[|\]', '', cats)
    cats = [map(str.strip, re.sub(r'"|\[|\]', '', cat).split('/')) for cat in cats]
    cats = sorted(cats)
    for cat in cats:
      if len(cat) == 1:
        tree[cat[0]] = {}
      elif len(cat) == 2:
        tree[cat[0]][cat[1]] = {}
      elif len(cat) == 3:
        tree[cat[0]][cat[1]] = {}
        tree[cat[0]][cat[1]][cat[2]] = {}
      elif len(cat) == 4:
        tree[cat[0]][cat[1]] = {}
        tree[cat[0]][cat[1]][cat[2]] = {}
        tree[cat[0]][cat[1]][cat[2]][cat[3]] = {}
  with open(directory_path + 'cleaned_' + raw_file_names[1], 'w') as contextweb_file_out:
    json.dump(tree, contextweb_file_out, sort_keys=True, indent=4)

  return json.dumps(tree, sort_keys=True, indent=4)

As you'll see I'm trying to build the dictionary I know how deep (how many keys I need) I am based on the length of the list passed in. Other things, I've tried, but erased, include, sorting the list of lists ( cats ) by the length of the sub list and reversing it, so that all list with 4 elements would be iterated on first. 你会看到我正在尝试构建字典我知道有多深(我需要多少个键)我基于传入的列表的长度。其他的东西,我已经尝试过但是已经删除,包括,排序列表( cats )的列表按子列表的长度并将其反转,以便首先迭代具有4个元素的所有列表。 I thought I could build the keys up that way because the key would exist for lower levels. 我以为我可以用这种方式构建密钥,因为密钥存在于较低级别。 That didn't really help. 这并没有真正帮助。

Actually, a for-loop can produce quite a nice solution too: 实际上,for循环也可以产生一个很好的解决方案:

>>> data
[['a', 'b', 'c', 'd'], ['a', 'b', 'c'], ['a', 's', 'd'], ['a', 'b', 'c', 'd', 'e']]
>>> tree = {}
>>> for cats in data:
...      curtree = tree
...      for c in cats:
...          curtree = curtree.setdefault(c, {})
... 
>>> tree
{'a': {'s': {'d': {}}, 'b': {'c': {'d': {'e': {}}}}}}

The .setdefault() method assures, that sub-dictionary is added if and only if key (category) has not existed before. .setdefault()方法确保当且仅当之前不存在密钥(类别)时才添加子字典。

The curtree starts from base dictionary tree and traverses / builds the tree using the categories. curtree从基本字典tree开始,并使用类别遍历/构建树。

Here's what it looks like with recursion: 这是递归的样子:

data = [
    ['Arts & Entertainment'],
    ['Arts & Entertainment', 'Animation & Comics'],
    ...,      # full data list elided for readability
    ['Arts & Entertainment', 'Television', 'Talk Show']
]

def classify(in_list):
    sub_dict = {}

    label_set = set([category[0] for category in in_list])
    for label in label_set:
        # print label
        sub_category = [sub[1:] for sub in in_list if sub[0] == label and len(sub) > 1]
        # print sub_category
        sub_dict[label] = classify(sub_category)

    return sub_dict


print classify(data)

Output (which I did not format for readability): 输出(我没有为可读性而格式化):

{'Arts & Entertainment': {'Celebrity Gossip': {}, 'Humor': {}, 'Television': {'Game Show': {}, 'Kids': {}, 'Science': {}, 'Talk Show': {}, 'Sitcom': {}, 'Reality': {}, 'Soap Opera': {}, 'News': {}}, 'Animation & Comics': {}, 'Movies': {'Action': {}, 'Drama': {}, 'Horror': {}, 'Comedy': {}, 'Documentary': {}}, 'Radio Stations': {}, 'Music': {'Alternative Music': {}, 'Christian Music': {}, 'Electronic Dance Music': {}, 'Pop Music': {}, 'Country Music': {}, 'Classical Music': {}, 'Rap': {}, 'Heavy Metal': {}, 'Blues': {}, 'Classic Rock': {}}, 'Fine Art': {}, 'Books & Literature': {}}}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM