As you can see, I've made my very first novice attempt to scrape this webpage . This is how I found the code . So as you can see, I inspected, found td, and what I want is in td a href.
import requests
from bs4 import BeautifulSoup
import lxml
# URL for the table
url = 'http://services.runescape.com/m=itemdb_rs/top100?list=2'
#grab the page
html = requests.get(url).text
#import into BS
soup = BeautifulSoup(html, "lxml")
print(soup)
#find data we want, starting with first row
for item_name in soup.find_all("td", {"class": "table-item-link"}):
print(table-item-link.text)
My objective: Scrape the page, grabbing the name of the item, and then placing the name of that item into a table, possibly. I'm not writing to CSV yet, as I'm way too novice for that. Just taking it one step at a time. For this step, I'm just trying to figure out how to grab the item name, and store it into a table. Next, I will learn how to move to the next object in the table I want to grab, the total rise, and then finally the percent change.
end goal: to be able to scrape a table like this, grab everything I need from each row, and store it into my own table, then export to CSV. But I'm not there yet, so one step at a time!
The following should do what you need. Creating a CSV file using Python's library is quite straight forward. It simply takes a list of items and writes them correctly as comma separated entries for you into the file:
import requests
from bs4 import BeautifulSoup
import lxml
import csv
header = ['Item', 'Start price', 'End price', 'Total Rise', 'Change']
url = 'http://services.runescape.com/m=itemdb_rs/top100?list=2'
html = requests.get(url).text
soup = BeautifulSoup(html, "lxml")
table = soup.find("a", {"class": "table-item-link"}).parent.parent.parent
with open('prices.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(header)
for tr in table.find_all('tr'):
row = [td.get_text(strip=True) for td in tr.find_all('td')]
del row[1]
csv_output.writerow(row)
Giving you an output prices.csv
starting as:
Item,Start price,End price,Total Rise,Change
Opal bolt tips,2,3,1,+50%
Half plain pizza,541,727,186,+34%
Poorly-cooked bird mea...,79,101,22,+27%
I use .parent.parent.parent
simply to work backwards to find the start of the containing table for the entry that you were looking for.
The HTML table is composed of a load of <tr>
elements, and within each are a load of <td>
elements. So the trick is to first find the table, then using this, use find_all()
to iterate through all of the <tr>
elements inside it. Then with each of these <td>
elements, use the get_text(strip=True)
to extract the text inside each element. strip=True
removes any extra newlines or spaces to ensure you just get the text you need.
I used a Python list comprehension to create the list of values in each row. A separate for loop could also be used, and might be easier to understand initially, eg
row = []
for td in tr.find_all('td'):
row.append(td.get_text(strip=True))
Note, the advantage of using Python's CSV library rather than simply writing the information yourself to the file is that if any of the values were to contain a comma in them, it would automatically correctly enclose the entry in quotes for you.
I edited your code to get the item :
import requests
from bs4 import BeautifulSoup
import lxml
# URL for the table
url = 'http://services.runescape.com/m=itemdb_rs/top100?list=2'
#grab the page
html = requests.get(url).text
#import into BS
soup = BeautifulSoup(html, "lxml")
#find data we want, starting with first row
# Tag is <a> not <td> as <td> is just holding the <a> tags
# You were also not using the right var name in your for loop
for item_name in soup.find_all("a", {"class": "table-item-link"}):
print(item_name.text)
To store your data easily in any format, I suggest tablib which is well documented and can handle many formats.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.