简体   繁体   English

使用NLTK查找事物列表(例如河流列表)

[英]Find a list of things (e.g. a list of rivers) using NLTK

I would like to get lists of things, for example a list of names of rivers or a list of types of animals. 我想获取事物列表,例如河流名称列表或动物类型列表。

NLTK looks like it might be the thing for this, but I'm not sure how to do what I want. NLTK可能看起来就是这样,但是我不确定该怎么做。 I'd like to have a function like: 我想要一个类似的功能:

get_list_of("river")

that would return something like 那会返回类似

["amazon", "mississippi", "thames", ...]

I would suggest looking at NLTK wordnet API, see http://www.nltk.org/howto/wordnet.html . 我建议您查看NLTK wordnet API,请参阅http://www.nltk.org/howto/wordnet.html

But after doing some digging seems like Proper Nouns (ie names of river are not easy to track down in wordnet) 但是,在进行一些挖掘之后,似乎像是专有名词(即在单词网中不容易找到河流的名称)

>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('river')
[Synset('river.n.01')]
>>> wn.synset('river.n.01')
Synset('river.n.01')
>>> wn.synset('river.n.01').lemma_names
['river']
>>> wn.synsets('amazon')
[Synset('amazon.n.01'), Synset('amazon.n.02'), Synset('amazon.n.03'), Synset('amazon.n.04')]
>>> wn.synset('amazon.n.01').definition
'a large strong and aggressive woman'
>>> wn.synset('amazon.n.02').definition
'(Greek mythology) one of a nation of women warriors of Scythia (who burned off the right breast in order to use a bow and arrow more effectively)'
>>> wn.synset('amazon.n.03').definition
"a major South American river; arises in the Andes and flows eastward into the South Atlantic; the world's 2nd longest river (4000 miles)"
>>> wn.synset('amazon.n.04').definition
'mainly green tropical American parrots'

As a brute force way, look for "river" in a synsets' definitions, as such: 作为强力方式,请在同义词集的定义中查找“ river”,例如:

from itertools import chain
list(chain(*[i.lemma_names for i in wn.all_synsets() if "river" in i.definition]))

[out]: [OUT]:

['anaclinal', 'cataclinal', 'Acheronian', 'Acherontic', 'Stygian', 'hit-and-run', 'fluvial', 'riparian', 'Lao', 'debouch', 'rejuvenate', 'drive', 'ford', 'ascend', 'plant', 'drive', 'ford', 'fording', 'drive', 'driving', 'flood_control', 'conservancy', 'road_rage', 'Aegospotami', 'Aegospotamos', 'Yalu_River', 'three-spined_stickleback', 'Gasterosteus_aculeatus', 'ten-spined_stickleback', 'Gasterosteus_pungitius', 'placoderm', 'hellbender', 'mud_puppy', 'Cryptobranchus_alleganiensis', 'plains_spadefoot', 'Scaphiopus_bombifrons', 'mud_turtle', 'cooter', 'river_cooter', 'Pseudemys_concinna', 'spiny_softshell', 'Trionyx_spiniferus', 'smooth_softshell', 'Trionyx_muticus', 'teal', 'pintail', 'pin-tailed_duck', 'Anas_acuta', 'Ancylus', 'genus_Ancylus', 'freshwater_mussel', 'freshwater_clam', 'long-clawed_prawn', 'river_prawn', 'Palaemon_australis', 'Platanistidae', 'family_Platanistidae', 'hippopotamus', 'hippo', 'river_horse', 'Hippopotamus_amphibius', 'waterbuck', 'Australian_lungfish', 'Queensland_lungfish', 'Neoceratodus_forsteri', 'alewife', 'Alosa_pseudoharengus', 'Pomolobus_pseudoharengus', 'sockeye', 'sockeye_salmon', 'red_salmon', 'blueback_salmon', 'Oncorhynchus_nerka', 'brown_trout', 'salmon_trout', 'Salmo_trutta', 'Australian_arowana', 'Dawson_River_salmon', 'saratoga', 'spotted_barramundi', 'spotted_bonytongue', 'Scleropages_leichardti', 'Australian_bonytongue', 'northern_barramundi', 'Scleropages_jardinii', 'crappie', 'striped_bass', 'striper', 'Roccus_saxatilis', 'rockfish', 'bolti', 'Tilapia_nilotica', 'Chinese_paddlefish', 'Psephurus_gladis', 'air_bag', 'Augean_stables', 'barouche', 'bend', 'curve', 'boathouse', 'box', 'box_seat', 'brassie', 'bridge', 'span', 'bridle', 'brougham', 'buggy_whip', 'cab', 'car_mirror', 'coach', 'four-in-hand', 'coach-and-four', 'cockpit', 'death_seat', 'dredge', 'dredging_bucket', 'elbow', 'flat_tip_screwdriver', 'hansom', 'hansom_cab', 'keelboat', 'Lake_Volta', 'levee', 'levee', 'L-plate', 'machine_screw', 'outfall', 'Phillips_screwdriver', 'pull-in', 'pull-up', 'river_boat', 'showboat', 'skidpan', 'spiral_ratchet_screwdriver', 'ratchet_screwdriver', 'towpath', 'towing_path', 'truck_stop', 'willowware', 'willow-pattern', 'woodscrew', 'Copehan', 'Volgaic', 'horn', 'rip', 'riptide', 'tide_rip', 'crosscurrent', 'countercurrent', 'crappie', 'red_salmon', 'sockeye', 'sockeye_salmon', 'logjam', 'Teamsters_Union', 'car_pool', 'conservancy', 'headwater', 'river_basin', 'basin', 'watershed', 'drainage_basin', 'catchment_area', 'catchment_basin', 'drainage_area', 'confluence', 'meeting', 'Mammoth_Cave_National_Park', 'Zion_National_Park', 'watershed', 'water_parting', 'divide', 'Yangon', 'Rangoon', "N'Djamena", 'Ndjamena', 'Fort-Lamy', 'capital_of_Chad', 'Kinshasa', 'Leopoldville', 'Saxony', 'Sachsen', 'Saxe', 'Cologne', 'Koln', 'Mannheim', 'Rhineland', 'Rheinland', 'Ruhr', 'Ruhr_Valley', 'West_Bank', 'Pennines', 'Pennine_Chain', 'Ottawa', 'Canadian_capital', 'capital_of_Canada', 'Antwerpen', 'Antwerp', 'Anvers', 'Orleans', 'Rhone-Alpes', 'Friesland', 'Timbuktu', 'Bydgoszcz', 'Bromberg', 'Novosibirsk', 'Tbilisi', 'Tiflis', 'capital_of_Georgia', 'Toledo', 'Selma', 'Denver', 'Mile-High_City', 'capital_of_Colorado', 'Hartford', 'capital_of_Connecticut', 'Savannah', 'Topeka', 'capital_of_Kansas', 'Louisville', 'New_Orleans', 'Detroit', 'Motor_City', 'Motown', 'Minneapolis', 'Saint_Paul', 'St._Paul', 'capital_of_Minnesota', 'Jefferson_City', 'capital_of_Missouri', 'Saint_Louis', 'St._Louis', 'Gateway_to_the_West', 'Billings', 'Great_Falls', 'Omaha', 'Concord', 'capital_of_New_Hampshire', 'Manchester', 'Trenton', 'capital_of_New_Jersey', 'Albuquerque', 'New_Netherland', 'Albany', 'capital_of_New_York', 'Erie_Canal', 'New_York', 'New_York_City', 'Greater_New_York', 'West_Point', 'Niagara_Falls', 'Schenectady', 'Bismarck', 'capital_of_North_Dakota', 'Fargo', 'Cincinnati', 'Tulsa', 'Chester', 'Philadelphia', 'City_of_Brotherly_Love', 'Pierre', 'capital_of_South_Dakota', 'Mount_Vernon', 'Charleston', 'capital_of_West_Virginia', 'Huntington', 'Morgantown', 'Parkersburg', 'Wheeling', 'Casper', 'Ciudad_Bolivar', 'Aare', 'Aar', 'Aare_River', 'Acheron', 'River_Acheron', 'Adige', 'River_Adige', 'Aire', 'River_Aire', 'Aire_River', 'Alabama', 'Alabama_River', 'Allegheny', 'Allegheny_River', 'Amazon', 'Amazon_River', 'Amur', 'Amur_River', 'Heilong_Jiang', 'Heilong', 'Angara', 'Angara_River', 'Tunguska', 'Upper_Tunguska', 'Apalachicola', 'Apalachicola_River', 'Araguaia', 'Araguaia_River', 'Araguaya', 'Araguaya_River', 'Aras', 'Araxes', 'Arauca', 'Argun', 'Argun_River', 'Ergun_He', 'Arkansas', 'Arkansas_River', 'Arno', 'Arno_River', 'River_Arno', 'Avon', 'River_Avon', 'Upper_Avon', 'Upper_Avon_River', 'Avon', 'River_Avon', 'bar', 'Bighorn', 'Bighorn_River', 'Big_Sioux_River', 'billabong', 'bluff', 'body_of_water', 'water', 'bottomland', 'bottom', 'Brahmaputra', 'Brahmaputra_River', 'branch', 'Brazos', 'Brazos_River', 'brook', 'creek', 'Caloosahatchee', 'Caloosahatchee_River', 'Cam', 'River_Cam', 'Cam_River', 'Canadian', 'Canadian_River', 'canyon', 'canon', 'Cape_Fear_River', 'channel', 'Chao_Phraya', 'Charles', 'Charles_River', 'Chattahoochee', 'Chattahoochee_River', 'Cimarron', 'Cimarron_River', 'Clinch_River', 'Clyde', 'Cocytus', 'River_Cocytus', 'Colorado', 'Colorado_River', 'Colorado', 'Colorado_River', 'Columbia', 'Columbia_River', 'Congo', 'Congo_River', 'Zaire_River', 'Connecticut', 'Connecticut_River', 'Coosa', 'Coosa_River', 'Cumberland', 'Cumberland_River', 'dale', 'Danube', 'Danube_River', 'Danau', 'Darling', 'Darling_River', 'Delaware', 'Delaware_River', 'delta', 'Demerara', 'Detroit_River', 'distributary', 'Dnieper', 'Dnieper_River', 'Don', 'Don_River', 'Ebro', 'Ebro_River', 'Elbe', 'Elbe_River', 'Elizabeth_River', 'estuary', 'Euphrates', 'Euphrates_River', 'Flint', 'Flint_River', 'floodplain', 'flood_plain', 'Forth', 'Forth_River', 'Fox_River', 'Ganges', 'Ganges_River', 'Gan_Jiang', 'Kan_River', 'Garonne', 'Garonne_River', 'Gila', 'Gila_River', 'gorge', 'Grand_River', 'Green', 'Green_River', 'headstream', 'Housatonic', 'Housatonic_River', 'Huang_He', 'Hwang_Ho', 'Yellow_River', 'Hudson', 'Hudson_River', 'IJssel', 'IJssel_river', 'Illinois_River', 'Indigirka', 'Indigirka_River', 'Indus', 'Indus_River', 'Irrawaddy', 'Irrawaddy_River', 'Irtish', 'Irtish_River', 'Irtysh', 'Irtysh_River', 'Isere', 'Isere_River', 'James', 'James_River', 'James', 'James_River', 'Jordan', 'Jordan_River', 'Kansas', 'Kansas_River', 'Kaw_River', 'Kasai', 'Kasai_River', 'River_Kasai', 'Kissimmee', 'Kissimmee_River', 'Klamath', 'Klamath_River', 'Kura', 'Kura_River', 'Lake_Chad', 'Chad', 'Lehigh_River', 'Lena', 'Lena_River', 'Lethe', 'River_Lethe', 'liman', 'Limpopo', 'Crocodile_River', 'Little_Bighorn', 'Little_Bighorn_River', 'Little_Horn', 'Little_Missouri', 'Little_Missouri_River', 'Little_Sioux_River', 'Little_Wabash', 'Little_Wabash_River', 'Loire', 'Loire_River', 'Mackenzie', 'Mackenzie_River', 'Madeira', 'Madeira_River', 'Magdalena', 'Magdalena_River', 'meander', 'Mekong', 'Mekong_River', 'Merrimack', 'Merrimack_River', 'Meuse', 'Meuse_River', 'Milk', 'Milk_River', 'Mississippi', 'Mississippi_River', 'Missouri', 'Missouri_River', 'Mobile', 'Mobile_River', 'Mohawk_River', 'Monongahela', 'Monongahela_River', 'Moreau_River', 'Murray', 'Murray_River', 'Murrumbidgee', 'Murrumbidgee_River', 'Namoi', 'Namoi_River', 'Nan', 'Nan_River', 'Neckar', 'Neckar_River', 'Neosho', 'Neosho_River', 'Neva', 'Neva_River', 'New_River', 'Niagara', 'Niagara_River', 'Niger', 'Niger_River', 'Nile', 'Nile_River', 'North_Platte', 'North_Platte_River', 'Ob', 'Ob_River', 'Oder', 'Oder_River', 'Ohio', 'Ohio_River', 'Orange', 'Orange_River', 'Orinoco', 'Orinoco_River', 'Osage', 'Osage_River', 'Outaouais', 'Ottawa', 'Ottawa_river', 'Ouachita', 'Ouachita_River', 'Ouse', 'Ouse_River', 'oxbow', 'oxbow_lake', 'Parana', 'Parana_River', 'Parnaiba', 'Parnahiba', 'Pearl_River', 'Pee_Dee', 'Pee_Dee_River', 'Penobscot', 'Penobscot_River', 'Ping', 'Ping_River', 'Platte', 'Platte_River', 'Po', 'Po_River', 'Potomac', 'Potomac_River', 'Purus', 'Purus_River', 'rapid', 'Rappahannock', 'Rappahannock_River', 'Rhine', 'Rhine_River', 'Rhein', 'Rhone', 'Rhone_River', 'Rio_Grande', 'Rio_Bravo', 'riparian_forest', 'riverbank', 'riverside', 'riverbed', 'river_bottom', 'river_boulder', 'Russian_River', 'Saale', 'Saale_River', 'Sabine', 'Sabine_River', 'Sacramento_River', 'Saint_John', 'Saint_John_River', 'St._John', 'St._John_River', 'Saint_Johns', 'Saint_Johns_River', 'St._Johns', 'St._Johns_River', 'Saint_Lawrence', 'Saint_Lawrence_River', 'St._Lawrence', 'St._Lawrence_River', 'Sambre', 'Sambre_River', 'sandbank', 'San_Joaquin_River', 'Sao_Francisco', 'Saone', 'Saone_River', 'Savannah', 'Savannah_River', 'Scheldt', 'Scheldt_River', 'Seine', 'Seine_River', 'Severn', 'River_Severn', 'Severn_River', 'Severn', 'Severn_River', 'Seyhan', 'Seyhan_River', 'Shari', 'Shari_River', 'Chari', 'Chari_River', 'Shenandoah_River', 'Styx', 'River_Styx', 'Sun_River', 'Suriname_River', 'Surinam_River', 'Susquehanna', 'Susquehanna_River', 'Tagus', 'Tagus_River', 'Tallapoosa', 'Tallapoosa_River', 'Tennessee', 'Tennessee_River', 'Thames', 'River_Thames', 'Thames_River', 'Tiber', 'Tevere', 'Tigris', 'Tigris_River', 'Tocantins', 'Tocantins_River', 'Tombigbee', 'Tombigbee_River', 'Trent', 'River_Trent', 'Trent_River', 'Trinity_River', 'Tunguska', 'Lower_Tunguska', 'Tunguska', 'Stony_Tunguska', 'Tyne', 'River_Tyne', 'Tyne_River', 'Urubupunga', 'Urubupunga_Falls', 'Uruguay_River', 'valley', 'vale', 'Vetluga', 'Vetluga_River', 'Vistula', 'Vistula_River', 'Volga', 'Volga_River', 'Volkhov', 'Volkhov_River', 'Volta', 'waterfall', 'falls', 'water_system', 'Weser', 'Weser_River', 'Willamette', 'Willamette_River', 'Yalu', 'Yalu_River', 'Chang_Jiang', 'Changjiang', 'Chang', 'Yangtze', 'Yangtze_River', 'Yangtze_Kiang', 'Yazoo', 'Yazoo_River', 'Yenisei', 'Yenisei_River', 'Yenisey', 'Yenisey_River', 'Yukon', 'Yukon_River', 'Zambezi', 'Zambezi_River', 'Zhu_Jiang', 'Canton_River', 'Chu_Kiang', 'Pearl_River', 'Charon', 'naiad', 'Achilles', 'finisher', 'Algonkian', 'Algonkin', 'Arikara', 'Aricara', 'Chinook', 'Conoy', 'Halchidhoma', 'Hidatsa', 'Gros_Ventre', 'Kansa', 'Kansas', 'Karok', 'Maidu', 'Maricopa', 'Missouri', 'Mohave', 'Mojave', 'Ofo', 'Omaha', 'Maha', 'Osage', 'Oto', 'Otoe', 'Pamlico', 'Ponca', 'Ponka', 'Quapaw', 'Shahaptian', 'Sahaptin', 'Sahaptino', 'Shawnee', 'Tsimshian', 'Walapai', 'Hualapai', 'Hualpai', 'Yeniseian', 'Yakut', 'charioteer', 'driver', 'honker', 'lasher', 'mahout', 'nondriver', 'road_hog', 'roadhog', 'speeder', 'speed_demon', 'tailgater', 'teamster', 'test_driver', 'wagoner', 'waggoner', 'Cartier', 'Jacques_Cartier', 'Oldfield', 'Barney_Oldfield', 'Berna_Eli_Oldfield', 'debacle', 'bald_cypress', 'swamp_cypress', 'pond_bald_cypress', 'southern_cypress', 'Taxodium_distichum', 'Montezuma_cypress', 'Mexican_swamp_cypress', 'Taxodium_mucronatum', 'pistia', 'water_lettuce', 'water_cabbage', 'Pistia_stratiotes', 'Pistia_stratoites', 'great_yellowcress', 'Rorippa_amphibia', 'Nasturtium_amphibium', 'giant_reed', 'Arundo_donax', 'Phragmites', 'genus_Phragmites', 'black_birch', 'river_birch', 'red_birch', 'Betula_nigra', 'river_red_gum', 'river_gum', 'Eucalyptus_camaldulensis', 'Eucalyptus_rostrata', 'false_indigo', 'bastard_indigo', 'Amorpha_fruticosa', 'thermal_pollution', 'water_pollution', 'alluvial_soil', 'Senegal_gum', 'silt']

But it seems like there are still some noise from the "brute force" method. 但是,“蛮力”方法似乎仍然有一些噪音。 Let's try to assume that if it is the name of the river it should start with an uppercase, so let's try: 让我们尝试假设,如果它是河流的名字,那么它应该以大写字母开头,所以让我们尝试:

list(chain(*[ [j for j in i.lemma_names if j[0].isupper()] for i in wn.all_synsets() if "river" in i.definition]))

[out]: [OUT]:

['Acheronian', 'Acherontic', 'Stygian', 'Lao', 'Aegospotami', 'Aegospotamos', 'Yalu_River', 'Gasterosteus_aculeatus', 'Gasterosteus_pungitius', 'Cryptobranchus_alleganiensis', 'Scaphiopus_bombifrons', 'Pseudemys_concinna', 'Trionyx_spiniferus', 'Trionyx_muticus', 'Anas_acuta', 'Ancylus', 'Palaemon_australis', 'Platanistidae', 'Hippopotamus_amphibius', 'Australian_lungfish', 'Queensland_lungfish', 'Neoceratodus_forsteri', 'Alosa_pseudoharengus', 'Pomolobus_pseudoharengus', 'Oncorhynchus_nerka', 'Salmo_trutta', 'Australian_arowana', 'Dawson_River_salmon', 'Scleropages_leichardti', 'Australian_bonytongue', 'Scleropages_jardinii', 'Roccus_saxatilis', 'Tilapia_nilotica', 'Chinese_paddlefish', 'Psephurus_gladis', 'Augean_stables', 'Lake_Volta', 'L-plate', 'Phillips_screwdriver', 'Copehan', 'Volgaic', 'Teamsters_Union', 'Mammoth_Cave_National_Park', 'Zion_National_Park', 'Yangon', 'Rangoon', "N'Djamena", 'Ndjamena', 'Fort-Lamy', 'Kinshasa', 'Leopoldville', 'Saxony', 'Sachsen', 'Saxe', 'Cologne', 'Koln', 'Mannheim', 'Rhineland', 'Rheinland', 'Ruhr', 'Ruhr_Valley', 'West_Bank', 'Pennines', 'Pennine_Chain', 'Ottawa', 'Canadian_capital', 'Antwerpen', 'Antwerp', 'Anvers', 'Orleans', 'Rhone-Alpes', 'Friesland', 'Timbuktu', 'Bydgoszcz', 'Bromberg', 'Novosibirsk', 'Tbilisi', 'Tiflis', 'Toledo', 'Selma', 'Denver', 'Mile-High_City', 'Hartford', 'Savannah', 'Topeka', 'Louisville', 'New_Orleans', 'Detroit', 'Motor_City', 'Motown', 'Minneapolis', 'Saint_Paul', 'St._Paul', 'Jefferson_City', 'Saint_Louis', 'St._Louis', 'Gateway_to_the_West', 'Billings', 'Great_Falls', 'Omaha', 'Concord', 'Manchester', 'Trenton', 'Albuquerque', 'New_Netherland', 'Albany', 'Erie_Canal', 'New_York', 'New_York_City', 'Greater_New_York', 'West_Point', 'Niagara_Falls', 'Schenectady', 'Bismarck', 'Fargo', 'Cincinnati', 'Tulsa', 'Chester', 'Philadelphia', 'City_of_Brotherly_Love', 'Pierre', 'Mount_Vernon', 'Charleston', 'Huntington', 'Morgantown', 'Parkersburg', 'Wheeling', 'Casper', 'Ciudad_Bolivar', 'Aare', 'Aar', 'Aare_River', 'Acheron', 'River_Acheron', 'Adige', 'River_Adige', 'Aire', 'River_Aire', 'Aire_River', 'Alabama', 'Alabama_River', 'Allegheny', 'Allegheny_River', 'Amazon', 'Amazon_River', 'Amur', 'Amur_River', 'Heilong_Jiang', 'Heilong', 'Angara', 'Angara_River', 'Tunguska', 'Upper_Tunguska', 'Apalachicola', 'Apalachicola_River', 'Araguaia', 'Araguaia_River', 'Araguaya', 'Araguaya_River', 'Aras', 'Araxes', 'Arauca', 'Argun', 'Argun_River', 'Ergun_He', 'Arkansas', 'Arkansas_River', 'Arno', 'Arno_River', 'River_Arno', 'Avon', 'River_Avon', 'Upper_Avon', 'Upper_Avon_River', 'Avon', 'River_Avon', 'Bighorn', 'Bighorn_River', 'Big_Sioux_River', 'Brahmaputra', 'Brahmaputra_River', 'Brazos', 'Brazos_River', 'Caloosahatchee', 'Caloosahatchee_River', 'Cam', 'River_Cam', 'Cam_River', 'Canadian', 'Canadian_River', 'Cape_Fear_River', 'Chao_Phraya', 'Charles', 'Charles_River', 'Chattahoochee', 'Chattahoochee_River', 'Cimarron', 'Cimarron_River', 'Clinch_River', 'Clyde', 'Cocytus', 'River_Cocytus', 'Colorado', 'Colorado_River', 'Colorado', 'Colorado_River', 'Columbia', 'Columbia_River', 'Congo', 'Congo_River', 'Zaire_River', 'Connecticut', 'Connecticut_River', 'Coosa', 'Coosa_River', 'Cumberland', 'Cumberland_River', 'Danube', 'Danube_River', 'Danau', 'Darling', 'Darling_River', 'Delaware', 'Delaware_River', 'Demerara', 'Detroit_River', 'Dnieper', 'Dnieper_River', 'Don', 'Don_River', 'Ebro', 'Ebro_River', 'Elbe', 'Elbe_River', 'Elizabeth_River', 'Euphrates', 'Euphrates_River', 'Flint', 'Flint_River', 'Forth', 'Forth_River', 'Fox_River', 'Ganges', 'Ganges_River', 'Gan_Jiang', 'Kan_River', 'Garonne', 'Garonne_River', 'Gila', 'Gila_River', 'Grand_River', 'Green', 'Green_River', 'Housatonic', 'Housatonic_River', 'Huang_He', 'Hwang_Ho', 'Yellow_River', 'Hudson', 'Hudson_River', 'IJssel', 'IJssel_river', 'Illinois_River', 'Indigirka', 'Indigirka_River', 'Indus', 'Indus_River', 'Irrawaddy', 'Irrawaddy_River', 'Irtish', 'Irtish_River', 'Irtysh', 'Irtysh_River', 'Isere', 'Isere_River', 'James', 'James_River', 'James', 'James_River', 'Jordan', 'Jordan_River', 'Kansas', 'Kansas_River', 'Kaw_River', 'Kasai', 'Kasai_River', 'River_Kasai', 'Kissimmee', 'Kissimmee_River', 'Klamath', 'Klamath_River', 'Kura', 'Kura_River', 'Lake_Chad', 'Chad', 'Lehigh_River', 'Lena', 'Lena_River', 'Lethe', 'River_Lethe', 'Limpopo', 'Crocodile_River', 'Little_Bighorn', 'Little_Bighorn_River', 'Little_Horn', 'Little_Missouri', 'Little_Missouri_River', 'Little_Sioux_River', 'Little_Wabash', 'Little_Wabash_River', 'Loire', 'Loire_River', 'Mackenzie', 'Mackenzie_River', 'Madeira', 'Madeira_River', 'Magdalena', 'Magdalena_River', 'Mekong', 'Mekong_River', 'Merrimack', 'Merrimack_River', 'Meuse', 'Meuse_River', 'Milk', 'Milk_River', 'Mississippi', 'Mississippi_River', 'Missouri', 'Missouri_River', 'Mobile', 'Mobile_River', 'Mohawk_River', 'Monongahela', 'Monongahela_River', 'Moreau_River', 'Murray', 'Murray_River', 'Murrumbidgee', 'Murrumbidgee_River', 'Namoi', 'Namoi_River', 'Nan', 'Nan_River', 'Neckar', 'Neckar_River', 'Neosho', 'Neosho_River', 'Neva', 'Neva_River', 'New_River', 'Niagara', 'Niagara_River', 'Niger', 'Niger_River', 'Nile', 'Nile_River', 'North_Platte', 'North_Platte_River', 'Ob', 'Ob_River', 'Oder', 'Oder_River', 'Ohio', 'Ohio_River', 'Orange', 'Orange_River', 'Orinoco', 'Orinoco_River', 'Osage', 'Osage_River', 'Outaouais', 'Ottawa', 'Ottawa_river', 'Ouachita', 'Ouachita_River', 'Ouse', 'Ouse_River', 'Parana', 'Parana_River', 'Parnaiba', 'Parnahiba', 'Pearl_River', 'Pee_Dee', 'Pee_Dee_River', 'Penobscot', 'Penobscot_River', 'Ping', 'Ping_River', 'Platte', 'Platte_River', 'Po', 'Po_River', 'Potomac', 'Potomac_River', 'Purus', 'Purus_River', 'Rappahannock', 'Rappahannock_River', 'Rhine', 'Rhine_River', 'Rhein', 'Rhone', 'Rhone_River', 'Rio_Grande', 'Rio_Bravo', 'Russian_River', 'Saale', 'Saale_River', 'Sabine', 'Sabine_River', 'Sacramento_River', 'Saint_John', 'Saint_John_River', 'St._John', 'St._John_River', 'Saint_Johns', 'Saint_Johns_River', 'St._Johns', 'St._Johns_River', 'Saint_Lawrence', 'Saint_Lawrence_River', 'St._Lawrence', 'St._Lawrence_River', 'Sambre', 'Sambre_River', 'San_Joaquin_River', 'Sao_Francisco', 'Saone', 'Saone_River', 'Savannah', 'Savannah_River', 'Scheldt', 'Scheldt_River', 'Seine', 'Seine_River', 'Severn', 'River_Severn', 'Severn_River', 'Severn', 'Severn_River', 'Seyhan', 'Seyhan_River', 'Shari', 'Shari_River', 'Chari', 'Chari_River', 'Shenandoah_River', 'Styx', 'River_Styx', 'Sun_River', 'Suriname_River', 'Surinam_River', 'Susquehanna', 'Susquehanna_River', 'Tagus', 'Tagus_River', 'Tallapoosa', 'Tallapoosa_River', 'Tennessee', 'Tennessee_River', 'Thames', 'River_Thames', 'Thames_River', 'Tiber', 'Tevere', 'Tigris', 'Tigris_River', 'Tocantins', 'Tocantins_River', 'Tombigbee', 'Tombigbee_River', 'Trent', 'River_Trent', 'Trent_River', 'Trinity_River', 'Tunguska', 'Lower_Tunguska', 'Tunguska', 'Stony_Tunguska', 'Tyne', 'River_Tyne', 'Tyne_River', 'Urubupunga', 'Urubupunga_Falls', 'Uruguay_River', 'Vetluga', 'Vetluga_River', 'Vistula', 'Vistula_River', 'Volga', 'Volga_River', 'Volkhov', 'Volkhov_River', 'Volta', 'Weser', 'Weser_River', 'Willamette', 'Willamette_River', 'Yalu', 'Yalu_River', 'Chang_Jiang', 'Changjiang', 'Chang', 'Yangtze', 'Yangtze_River', 'Yangtze_Kiang', 'Yazoo', 'Yazoo_River', 'Yenisei', 'Yenisei_River', 'Yenisey', 'Yenisey_River', 'Yukon', 'Yukon_River', 'Zambezi', 'Zambezi_River', 'Zhu_Jiang', 'Canton_River', 'Chu_Kiang', 'Pearl_River', 'Charon', 'Achilles', 'Algonkian', 'Algonkin', 'Arikara', 'Aricara', 'Chinook', 'Conoy', 'Halchidhoma', 'Hidatsa', 'Gros_Ventre', 'Kansa', 'Kansas', 'Karok', 'Maidu', 'Maricopa', 'Missouri', 'Mohave', 'Mojave', 'Ofo', 'Omaha', 'Maha', 'Osage', 'Oto', 'Otoe', 'Pamlico', 'Ponca', 'Ponka', 'Quapaw', 'Shahaptian', 'Sahaptin', 'Sahaptino', 'Shawnee', 'Tsimshian', 'Walapai', 'Hualapai', 'Hualpai', 'Yeniseian', 'Yakut', 'Cartier', 'Jacques_Cartier', 'Oldfield', 'Barney_Oldfield', 'Berna_Eli_Oldfield', 'Taxodium_distichum', 'Montezuma_cypress', 'Mexican_swamp_cypress', 'Taxodium_mucronatum', 'Pistia_stratiotes', 'Pistia_stratoites', 'Rorippa_amphibia', 'Nasturtium_amphibium', 'Arundo_donax', 'Phragmites', 'Betula_nigra', 'Eucalyptus_camaldulensis', 'Eucalyptus_rostrata', 'Amorpha_fruticosa', 'Senegal_gum']

Let's go crazy and say that only if the word "River" appears in the lemma, it is a river: 让我们疯狂地说,只有在引理中出现“河”这个词时,它才是一条河:

>>> list(chain(*[ [j for j in i.lemma_names if j[0].isupper() and "River" in j] for i in wn.all_synsets() if "river" in i.definition]))

[out]: [OUT]:

['Yalu_River', 'Dawson_River_salmon', 'Aare_River', 'River_Acheron', 'River_Adige', 'River_Aire', 'Aire_River', 'Alabama_River', 'Allegheny_River', 'Amazon_River', 'Amur_River', 'Angara_River', 'Apalachicola_River', 'Araguaia_River', 'Araguaya_River', 'Argun_River', 'Arkansas_River', 'Arno_River', 'River_Arno', 'River_Avon', 'Upper_Avon_River', 'River_Avon', 'Bighorn_River', 'Big_Sioux_River', 'Brahmaputra_River', 'Brazos_River', 'Caloosahatchee_River', 'River_Cam', 'Cam_River', 'Canadian_River', 'Cape_Fear_River', 'Charles_River', 'Chattahoochee_River', 'Cimarron_River', 'Clinch_River', 'River_Cocytus', 'Colorado_River', 'Colorado_River', 'Columbia_River', 'Congo_River', 'Zaire_River', 'Connecticut_River', 'Coosa_River', 'Cumberland_River', 'Danube_River', 'Darling_River', 'Delaware_River', 'Detroit_River', 'Dnieper_River', 'Don_River', 'Ebro_River', 'Elbe_River', 'Elizabeth_River', 'Euphrates_River', 'Flint_River', 'Forth_River', 'Fox_River', 'Ganges_River', 'Kan_River', 'Garonne_River', 'Gila_River', 'Grand_River', 'Green_River', 'Housatonic_River', 'Yellow_River', 'Hudson_River', 'Illinois_River', 'Indigirka_River', 'Indus_River', 'Irrawaddy_River', 'Irtish_River', 'Irtysh_River', 'Isere_River', 'James_River', 'James_River', 'Jordan_River', 'Kansas_River', 'Kaw_River', 'Kasai_River', 'River_Kasai', 'Kissimmee_River', 'Klamath_River', 'Kura_River', 'Lehigh_River', 'Lena_River', 'River_Lethe', 'Crocodile_River', 'Little_Bighorn_River', 'Little_Missouri_River', 'Little_Sioux_River', 'Little_Wabash_River', 'Loire_River', 'Mackenzie_River', 'Madeira_River', 'Magdalena_River', 'Mekong_River', 'Merrimack_River', 'Meuse_River', 'Milk_River', 'Mississippi_River', 'Missouri_River', 'Mobile_River', 'Mohawk_River', 'Monongahela_River', 'Moreau_River', 'Murray_River', 'Murrumbidgee_River', 'Namoi_River', 'Nan_River', 'Neckar_River', 'Neosho_River', 'Neva_River', 'New_River', 'Niagara_River', 'Niger_River', 'Nile_River', 'North_Platte_River', 'Ob_River', 'Oder_River', 'Ohio_River', 'Orange_River', 'Orinoco_River', 'Osage_River', 'Ouachita_River', 'Ouse_River', 'Parana_River', 'Pearl_River', 'Pee_Dee_River', 'Penobscot_River', 'Ping_River', 'Platte_River', 'Po_River', 'Potomac_River', 'Purus_River', 'Rappahannock_River', 'Rhine_River', 'Rhone_River', 'Russian_River', 'Saale_River', 'Sabine_River', 'Sacramento_River', 'Saint_John_River', 'St._John_River', 'Saint_Johns_River', 'St._Johns_River', 'Saint_Lawrence_River', 'St._Lawrence_River', 'Sambre_River', 'San_Joaquin_River', 'Saone_River', 'Savannah_River', 'Scheldt_River', 'Seine_River', 'River_Severn', 'Severn_River', 'Severn_River', 'Seyhan_River', 'Shari_River', 'Chari_River', 'Shenandoah_River', 'River_Styx', 'Sun_River', 'Suriname_River', 'Surinam_River', 'Susquehanna_River', 'Tagus_River', 'Tallapoosa_River', 'Tennessee_River', 'River_Thames', 'Thames_River', 'Tigris_River', 'Tocantins_River', 'Tombigbee_River', 'River_Trent', 'Trent_River', 'Trinity_River', 'River_Tyne', 'Tyne_River', 'Uruguay_River', 'Vetluga_River', 'Vistula_River', 'Volga_River', 'Volkhov_River', 'Weser_River', 'Willamette_River', 'Yalu_River', 'Yangtze_River', 'Yazoo_River', 'Yenisei_River', 'Yenisey_River', 'Yukon_River', 'Zambezi_River', 'Canton_River', 'Pearl_River']

Much better but i think you're better off just crawling the names from http://en.wikipedia.org/wiki/Lists_of_rivers . 更好,但我认为您最好只是从http://en.wikipedia.org/wiki/Lists_of_rivers检索名称。 Have fun! 玩得开心!

To show that the solution for "river" using NLTK wordnet won't scale other entities, and also answer @tripleee's question. 为了说明使用NLTK词网解决“河流”问题的方法不会扩展其他实体,还可以回答@tripleee的问题。

If you're looking for animals , you can simply recursively get all hyponyms of animals, as such: 如果您正在寻找animals ,则可以简单地递归获得所有动物的下义词,例如:

list(set([w for s in vehicle.closure(lambda s:s.hyponyms()) for w in s.lemma_names]))

We can use sentence tagging to get proper nouns. 我们可以使用句子标记来获取专有名词。 Filter them for required output. 过滤它们以获得所需的输出。

Perhaps: 也许:

  from nltk.tag import pos_tag
  sentence = "Amazon is great river. Mississippi is awesome too."
  tagged_sent = pos_tag(sentence.split())

  tagged_sent will yield similar tagged out where NNP is the proper noun.

[('Amazon', 'NNP'), ('is', 'VBZ'), ('great', 'JJ'), ('river.', 'NNP'), ('Mississippi', 'NNP'), ('is', 'VBZ'), ('awesome', 'VBN'), ('too.', '-NONE-')] [('Amazon','NNP'),('is','VBZ'),('great','JJ'),('river。','NNP'),('Mississippi','NNP' ),('is','VBZ'),('awesome','VBN'),('too。','-NONE-')]

   propernouns = [word for word,pos in tagged_sent if pos == 'NNP']
   propernouns would return
   ['Amazon', 'river.', 'Mississippi']

You can set categories for each and use a function to return them. 您可以为每个类别设置类别,并使用函数将其返回。

If you want to get lists of things of a specific type eg rivers http://dbPedia.org , http://freebase.com or http://wikidata.org are the better choice. 如果您想获取特定类型的事物的列表,例如Rivers http : //dbPedia.org,http://freebase.comhttp://wikidata.org是更好的选择。

This dbPedia SPARQL query returns all rivers known to Wikipedia: 此dbPedia SPARQL查询返回Wikipedia已知的所有河流:

SELECT ?name ?description WHERE {
   {?river rdf:type dbpedia-owl:River} .
    ?river foaf:name ?name .
    ?river rdfs:comment ?description .
}
ORDER BY ?name

http://bit.ly/1jc8Ip6 http://bit.ly/1jc8Ip6

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 Python 或 R 对大型文本语料库(例如职位列表)进行聚类? - How to cluster a large text corpus (e.g. list of job titles) using Python or R? 如何使用Python将电话号码列表(例如202.202.2020)格式化为(202)202-2020? - How to format a list of phone numbers e.g. 202.202.2020 into (202) 202-2020 using Python? 使用容器(例如元组或列表)进行 Numpy 切片 - Numpy slicing with container (e.g. tuple or list) 为什么分配给空列表(例如 [] = "")不是错误? - Why isn't assigning to an empty list (e.g. [] = "") an error? 可以例如pip列出extras_require的选项吗? - Can e.g. pip list the options for extras_require? 如何从 for 循环 output 创建结构(例如列表)? - How to Create a structure (e.g. a list) from a for loop output? Python:如何使用列表在列表之间进行复制,例如 list1[index1] = list2[index2] - Python: how to copy between lists using lists, e.g. list1[index1] = list2[index2] 如何将字符串(例如'A')引用到更大列表的索引(例如['A','B','C','D',...])? - How can I reference a string (e.g. 'A') to the index of a larger list (e.g. ['A', 'B', 'C', 'D', ...])? 关于 python 列表中错误的快速提问 * 常量,例如 [a, b, c, d, e ] * x - Quick question on error in python list * a constant e.g. [a , b, c, d , e ] * x 我们如何 label 矩阵 plot 与预定义的标签列表(例如素数列表)? - How do we label a matrix plot with a predefined list of labels (e.g. a list of prime numbers)?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM