简体   繁体   中英

What am I missing from this script to scrape a row of a table from a webpage?

As you can see, I've made my very first novice attempt to scrape this webpage . This is how I found the code . So as you can see, I inspected, found td, and what I want is in td a href.

import requests
from bs4 import BeautifulSoup
import lxml

# URL for the table
url = 'http://services.runescape.com/m=itemdb_rs/top100?list=2'
#grab the page
html = requests.get(url).text
#import into BS
soup = BeautifulSoup(html, "lxml")
print(soup)

#find data we want, starting with first row
for item_name in soup.find_all("td", {"class": "table-item-link"}):
    print(table-item-link.text)

My objective: Scrape the page, grabbing the name of the item, and then placing the name of that item into a table, possibly. I'm not writing to CSV yet, as I'm way too novice for that. Just taking it one step at a time. For this step, I'm just trying to figure out how to grab the item name, and store it into a table. Next, I will learn how to move to the next object in the table I want to grab, the total rise, and then finally the percent change.

end goal: to be able to scrape a table like this, grab everything I need from each row, and store it into my own table, then export to CSV. But I'm not there yet, so one step at a time!

The following should do what you need. Creating a CSV file using Python's library is quite straight forward. It simply takes a list of items and writes them correctly as comma separated entries for you into the file:

import requests
from bs4 import BeautifulSoup
import lxml
import csv

header = ['Item', 'Start price', 'End price', 'Total Rise', 'Change']
url = 'http://services.runescape.com/m=itemdb_rs/top100?list=2'
html = requests.get(url).text
soup = BeautifulSoup(html, "lxml")
table = soup.find("a", {"class": "table-item-link"}).parent.parent.parent

with open('prices.csv', 'w', newline='') as f_output:
    csv_output = csv.writer(f_output)
    csv_output.writerow(header)

    for tr in table.find_all('tr'):
        row = [td.get_text(strip=True) for td in tr.find_all('td')]
        del row[1]
        csv_output.writerow(row)

Giving you an output prices.csv starting as:

Item,Start price,End price,Total Rise,Change
Opal bolt tips,2,3,1,+50%
Half plain pizza,541,727,186,+34%
Poorly-cooked bird mea...,79,101,22,+27%

I use .parent.parent.parent simply to work backwards to find the start of the containing table for the entry that you were looking for.

The HTML table is composed of a load of <tr> elements, and within each are a load of <td> elements. So the trick is to first find the table, then using this, use find_all() to iterate through all of the <tr> elements inside it. Then with each of these <td> elements, use the get_text(strip=True) to extract the text inside each element. strip=True removes any extra newlines or spaces to ensure you just get the text you need.

I used a Python list comprehension to create the list of values in each row. A separate for loop could also be used, and might be easier to understand initially, eg

row = []

for td in tr.find_all('td'):
    row.append(td.get_text(strip=True))

Note, the advantage of using Python's CSV library rather than simply writing the information yourself to the file is that if any of the values were to contain a comma in them, it would automatically correctly enclose the entry in quotes for you.

I edited your code to get the item :

import requests
from bs4 import BeautifulSoup
import lxml

# URL for the table
url = 'http://services.runescape.com/m=itemdb_rs/top100?list=2'
#grab the page
html = requests.get(url).text
#import into BS
soup = BeautifulSoup(html, "lxml")
#find data we want, starting with first row
# Tag is <a> not <td> as <td> is just holding the <a> tags
# You were also not using the right var name in your for loop
for item_name in soup.find_all("a", {"class": "table-item-link"}):
    print(item_name.text)

To store your data easily in any format, I suggest tablib which is well documented and can handle many formats.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM