简体   繁体   中英

How to scrape data from a website and write to a csv in a specified format in R?

I am trying to scrape data from https://www.booking.com/country.html .

The idea is to extract all numbers regarding any kind of accommodation listed for a particular country.

The output needs to have the list of all countries in 'column A' of an Excel file and the relevant number of listings for different property types (Ex. Apartments, Hostels, Resorts, etc.) in each respective country adjacent to the country names in separate columns.

I need to capture all the details for all the property types for a given country.

下图描述了所需的输出格式。

The above image describes the output format required in Excel. I am able to get the country using the below code but not the property types and their respective data.

How to get the data iteratively in function for all the countries and write in a csv.

library(rvest)
library(reshape2)
library(stringr)

url <- "https://www.booking.com/country.html"

bookingdata <- read_html(url)

#extracting the country
country <- html_nodes(bookingdata, "h2 > a") %>% 
  html_text()
write.csv(country, 'D:\\web scraping\\country.csv' ,row.names = FALSE)
print(country)

#extracting the data inside the inner div 
html_nodes(bookingdata, "div >div > div > ul > li > a")%>%
  html_text()
for (i in country) {
print(i)
html_nodes(pg, "ul > li > a") %>% 
  html_text()
  print(accomodation)
}

#getting all the data
accomodation <- html_nodes(pg, "ul > li > a") %>% 
  html_text()

#separating the numbers
accomodation.num <- (str_extract(accomodation, "[0-9]+"))
#separating the characters
accomodation.char <- (str_extract(accomodation,"[aA-zZ]+"))
#separating unique characters
unique(accomodation.char)
import requests
from bs4 import BeautifulSoup
import pandas as pd

r = requests.get('https://www.booking.com/country.html')
soup = BeautifulSoup(r.text, 'html.parser')

data = []
for item in soup.findAll('div', attrs={'class': 'block_third block_third--flag-module'}):
    country = [(country.text).replace('\n', '')
               for country in item.findAll('a')]
    data.append(country)

final = []
for item in data:
    final.append(item)

df = pd.DataFrame(final)
df.to_csv('output.csv')

View Output Online: Click Here

在此处输入图像描述

Another Version for user requirements via CHAT:

import requests
from bs4 import BeautifulSoup
import pandas as pd

r = requests.get('https://www.booking.com/country.html')
soup = BeautifulSoup(r.text, 'html.parser')

data = []
for item in soup.select('div.block_third.block_third--flag-module'):
    country = [(country.text).replace('\n', '')
               for country in item.select('a')]
    data.append(country)

final = []
for item in data:
    final.append(item)

df = pd.DataFrame(final).set_index(0)
df.index.name = 'location'
split = df.stack().str.extract('^(?P<freq>[\d,]+)\s+(?P<category>.*)').reset_index(level=1, drop=True)
pvt = split.pivot(columns='category', values='freq')
pvt.sort_index(axis=1, inplace=True)
pvt.reset_index().to_csv('output2.csv', index=False)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM