简体   繁体   中英

Pandas read_html() returns 'nan' on a specific column

I am using pandas to scrape a website but it returns a whole column with 'nan' values instead of the proper ones. I have tried changing several read_html() parameters, such as flavor, converters, and na_values without success. I noticed that the html code of the troubled column differs in that the rest of them are 'td class=' type, while the one not being read properly reads 'td data-behavior=' . When I simply copy/paste the table into excel, everything is pasted ok. I would kindly appreciate any help.

I tried changing some parameters on read_html() without success. I have also tried to get the table using lxml/xpath and didn't succeed either.

week_data = pd.read_html('https://www.espn.co.uk/nfl/fixtures/_/week/2/seasontype/1',
                          converters={'time': str})

The column should have strings containing the time of the match.

They're embedding the date time in the data-date attribute so another option rather than resorting to selenium is simply to pull that attribute out and stick it in the td element using beautifulsoup.

from bs4 import BeautifulSoup
import requests
import pandas as pd
import dateutil
from datetime import datetime

espn_page = requests.get('https://www.espn.co.uk/nfl/fixtures/_/week/2/seasontype/1')
soup = BeautifulSoup(espn_page.content, 'html.parser')
espn_schedule = soup.find('div', {'class': 'main-content'})
for td in espn_schedule.find_all('td', {'data-behavior': 'date_time'}):
    utc = dateutil.parser.parse(td.get('data-date'))
    localtime = utc.astimezone(dateutil.tz.gettz())
    td.string = localtime.strftime("%I:%M")


df = pd.read_html(str(espn_schedule))
print(df[0].columns)
print(df[0][df[0].columns[2]])

Your code works perfectly, but I rather need the text contained after the 'href' element, which is '6:00 PM':

So I modified your code like this:

for td in espn_schedule.find_all('a', {'data-dateformat': 'time1'}):
    td.string = td.get('href')

And I succesfully get to the element I want, except that I don't know how to extract the text after it (which is '6:00 PM'). How can I do that?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM