简体   繁体   中英

Scraping list of links using R

I would like to Scrape and extract a list of all dependent links using R. For example, consider: List of Cuisines on wikipedia here the cuisines are divided into regions, ethnicities, etc, which are links themselves and further sub-divided into more links and hierarchies. I would like to extract this entire hierarchy in R. Using a general RegEx for defining links would return all the links in the webpage, but I would like to have a table where all the dependencies are listed such as:

  1. List of Cuisines:
    • List Of Asian Cuisines
    • List of European Cuisines
    • List of Central European Cuisines
    • Austrian Cuisine
    • Bulgarian Cuisine
    • Czech Cuisine
    • German Cuisine..and so-on.
    • List of Oceanic Cuisine ...

I know how to scrape data off one webpage using R. I am fairly new to it and would like to know how do i go about extracting dependencies between links.

You can do the following for example: If this is what you are looking for

require(rvest)
require(magrittr)
session <- html_session("https://en.wikipedia.org/wiki/List_of_cuisines")
session %>% html_nodes("ul:nth-child(13) a") %>% html_text()
 [1] "Ainu"               "Akan"               "Arab"               "Assyrian"           "Balochi"           
 [6] "Berber"             "Buddhist"           "Bulgarian"          "Cajun"              "Chinese Islamic"   
[11] "Circassian"         "Crimean Tatar"      "Inuit"              "Italian American"   "Jewish"            
[16] "Sephardic"          "Mizrahi"            "Bukharan"           "Syrian Jewish"      "Kurdish"           
[21] "Malayali Food"      "Louisiana Creole"   "Maharashtrian"      "Mordovian"          "Native American"   
[26] "Parsi"              "Pashtun"            "Pennsylvania Dutch" "Peranakan"          "Persian cuisine"   
[31] "Punjabi"            "Rajasthani"         "Romani"             "Sami"               "Sindhi"            
[36] "Tatar"              "Yamal"              "Zanzibari"          "South Indian"    

If you want to dig deeper and scrape all of the links you can go on as follows:

cousin_links <- session %>% html_nodes("ul:nth-child(13) a") %>% html_attr("href")
articles <- lapply(cousin_links, jump_to, x = session)
explainaition <- lapply(articles, function(a){
  a %>% html %>% html_node("p") %>% html_text
})

Which gives you a list with the first Wikipedia Explaination (the one above the Contents-Box

> head(explainaition)
[[1]]
[1] "Ainu cuisine is the cuisine of the ethnic Ainu in Japan. The cuisine differs markedly from that of the majority Yamato people of Japan. Raw meat like sashimi, for example, is not served in Ainu cuisine, which instead uses methods such as boiling, roasting and curing to prepare meat. The island of Hokkaidō in northern Japan is where most Ainu live today; however, they once inhabited most of the Kuril islands, the southern half of Sakhalin island, and parts of northern Honshū Island."

[[2]]
[1] "Akan cuisine, the cuisine of the Akan people, includes meat and fish (seafood) grilled over hot coals, wide and varied range of soups, stews, several kinds of starch foods, groundnut, palm, patties (or empanadas), ground corn (maize), sadza, ugali."

[[3]]
[1] "Arab cuisine (Arabic: مطبخ عربي‎) is defined as the various regional cuisines spanning the Arab world, from Mesopotamia to North-Africa. Arab cuisine often incorporates the Levantine and Egyptian culinary traditions."

[[4]]
[1] "The cuisine of the indigenous Assyrian people from northern Iraq, north eastern Syria, north western Iran and south eastern Turkey is similar to other Middle Eastern cuisines. It is rich in grains, meat, tomato, and potato. Rice is usually served with every meal accompanied by a stew which is typically poured over the rice. Tea is typically consumed at all times of the day with or without meals, alone or as a social drink. Cheese, crackers, biscuits, baklawa, or other snacks are often served alongside the tea as appetizers. Dietary restrictions may apply during Lent in which certain types of foods may not be consumed; often meaning animal-derived. Alcohol is rather popular specifically in the form of Arak and Wheat Beer. Unlike in Jewish cuisine and Islamic cuisines in the region, pork is allowed, but it is not widely consumed because of restrictions upon availability imposed by the Muslim majority."

[[5]]
[1] "Balochi cuisine refers to the food and cuisine of the Baloch people from the Balochistan region, comprising the Pakistani Balochistan province as well as Sistan and Baluchestan in Iran and Balochistan, Afghanistan. Baloch food has a regional variance in contrast to many other cuisines of Pakistan[1][2][3][4] and Iran."

[[6]]
[1] "The Amazigh (Berber) cuisine is considered as a traditional cuisine which evolved little in the course of time."

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM