简体   繁体   中英

Scrape part of a table with BeautifulSoup

My code so far looks like this:

from bs4 import BeautifulSoup
import csv
html = open("Greyhound Race and Breeding1.html").read()
soup = BeautifulSoup(html)
table = soup.find("table")

output_rows = []
for table_row in table.findAll('tr'):
  columns = table_row.findAll('td')
  output_row = []
for column in columns:
  output_row.append(column.text)
  output_rows.append(output_row)

with open('output.csv', 'w') as csvfile:
  writer = csv.writer(csvfile)
  writer.writerows(output_rows)

This gets more than I want, I only want to get the part of the table that follows from a td with the title="order in which the dogs arrived at the finish". How can I modify my code to solve this?

My guess is that table = soup.find("table") should be modified so that I can find

    <td title="order in which the dogs arrived at the finish">. 

But I don't know how. Maybe I should somehow set table to be the parent of the td with the

    <td title="order in which the dogs arrived at the finish">

<table> 
<tr>
  <td>I don't want this</td>
  <td>Or this</td>
</tr>
</table>

<table> 
<tr>
  <td>I don't want this</td>
  <td>Or this</td>
</tr>
</table>
<table>
<tr>
<td title="order in which the dogs arrived at the finish"> I want this and the rest of  the document</td>
<td> More things I want</td>
</tr> 
</table>

I almost got Jack Fleetings solution to work

html = open("Greyhound Race and Breeding1.html").read()
soup = BeautifulSoup(html)
#table = soup.find("table")["title": "order in which the dogs arrived at the finish"]

#table = str(soup.find("table",{"title": "order in which the dogs arrived at the finish"}))
table = soup.find("table")
for table in soup.select('table'):
    if table.select_one('td[title="order in which the dogs arrived at the finish"]')is not None:
                          newTable = table
output_rows = []
for table_row in newTable.findAll("tr"):
   columns = table_row.findAll("td")
   output_row = []
   for column in columns:
      output_row.append(column.text)
      output_rows.append(output_row)

with open("output8.csv", "w") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(output_rows)


The problem is that it repeats the same row several times, but it is the correct table. I tried several times to correct this,but no luck. So I decided to switch to using pandas instead:


from bs4 import BeautifulSoup
import csv
import pandas as pd



df = pd.read_html("Greyhound Race and Breeding1.html")

#This shows how many tables there are
print (len(df)) 

#To find the right table, I bruteforced it by printing print(df[for each table]) #Turns out the table I was looking for was df[8]
print(df[8])

#Finally we move the table to a csv file
df[8].to_csv("Table.csv")


If I understand you correctly, you can use css selectors to do this:

for table in soup.select('table'):
    target = table.select('td[title="order in which the dogs arrived at the finish"]')
    if len(target)>0:
        print(table)

If you know that only one table meets the requirement, you can use:

target = soup.select_one('td[title="order in which the dogs arrived at the finish"]')
print(target.findParent())

Output:

<table>
<tr>
<td title="order in which the dogs arrived at the finish"> I want this and the rest of  the document</td>
<td> More things I want</td>
</tr>
</table>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM