简体   繁体   中英

Scraping for a span using Python and BeautifulSoup returns nothing

I'm trying to extract an specific text from this link:

http://www1.folha.uol.com.br/fsp/mercado/index-20121030.shtml

I wrote this function to find and extract a piece of text:

def manchete_11112011_30102012(b):
    soup = make_soup(b)
    data = [span.string for span in soup.find("font")]
    noticias = [b.text for b in soup.findAll("a")]
    return {"noticias": noticias,
            "data": data}

OK. My problem is with the "data" line. When it runs it return nothing. When I write "span.string" it return "[none]" and when I write "span.text" it return "[u"]"

Here is the HTML code I'm looking for. I need the text content inside <span id="spanLongDate"> :

<<td width="430" align="right"><font size="1"><span id="spanLongDate">São Paulo, terça-feira, 30 de outubro de 2012</span></font><img src="images/mercado.gif" hspace="10" alt="Mercado"></td>

Is there any other way I could extract the text? I mean, am I writing the code wrong, or is the text format not compatible? And what does "[u"]" mean?

To find the id = spanLongDate use the following fragment

//get the span you are looking for
span = soup.find("span", attrs = {"id":"spanLongDate"}) 

//get the text out of the span
data = span.get_text()

Please note this will only get one instance if you have to find multiple instances use .find_all

ETA:

Based on your below comment I went and looked at the page source and even ran it on my machine. Here is a function that allows you to dump out what beautifulsoup is seeing. This is helpful because sometimes it doesn't see what you see when you view the source in the browser.

def dumpPage():

    url = "http://www1.folha.uol.com.br/fsp/mercado/index-20121030.shtml"
    print("url is: " + url)
    page=urllib.request.urlopen(url)

    soup = BeautifulSoup(page.read())
    print("read soup")
    print(soup)

When I printed it out and searched for "spanLongDate" I got the following fragment of interest.

<td align="right" width="430"><font size="1"><span id="spanLongDate"></span></font><img alt="Mercado" hspace="10" src="images/mercado.gif"/></td>

This has no Sao Paulo text in it. I then hit F12 in my Chrome browser to find the raw source and there was also no text in the spanLongDate <div> .

Perhaps the page was updated?

If you only want the date you should look for it in other places. If you dump out the soup and then search for 2012 you will see it in a number of places. It is easy to get it out of the title with the following code.

url = "http://www1.folha.uol.com.br/fsp/mercado/index-20121030.shtml"
page=urllib.request.urlopen(url)
soup = BeautifulSoup(page.read())
theDateTag = soup.find("title")
theDateString = theDateTag.get_text()
print(theDateString)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM