简体   繁体   中英

Beautifulsoup can't find text

I'm trying to write a scraper in python using urllib and beautiful soup. I have a csv of URLs for news stories, and for ~80% of the pages the scraper works, but when there is a picture at the top of the story the script no longer pulls the time or the body text. I am mostly confused because soup.find and soup.find_all don't seem to produce different results. I have tried a variety of different tags that should capture the text as well as 'lxml' and 'html.parser.'

Here is the code:

testcount = 0
titles1 = []
bodies1 = []
times1 = []

data = pd.read_csv('URLsALLjun27.csv', header=None)
for url in data[0]:
try:
    html = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(html, "lxml")

    titlemess = soup.find(id="title").get_text() #getting the title
    titlestring = str(titlemess) #make it a string
    title = titlestring.replace("\n", "").replace("\r","")
    titles1.append(title)

    bodymess = soup.find(class_="article").get_text() #get the body with markup
    bodystring = str(bodymess) #make body a string
    body = bodystring.replace("\n", "").replace("\u3000","") #scrub markup
    bodies1.append(body) #add to list for export

    timemess = soup.find('span',{"class":"time"}).get_text()
    timestring = str(timemess)
    time = timestring.replace("\n", "").replace("\r","").replace("年", "-").replace("月","-").replace("日", "")
    times1.append(time)

    testcount = testcount +1 #counter
    print(testcount)
except Exception as e:
    print(testcount, e)

And here are some of the results I get (those marked 'nonetype' are the ones where the title was successfully pulled but body/time is empty)

1 http://news.xinhuanet.com/politics/2016-06/27/c_1119122255.htm

2 http://news.xinhuanet.com/politics/2016-05/22/c_129004569.htm 'NoneType' object has no attribute 'get_text'

Any help would be much appreciated! Thanks.

EDIT: I don't have '10 reputation points' so I can't post more links to test but will comment with them if you need more examples of pages.

The issue is that there is no class="article" on the website with the picture in it and same with the "class":"time" . Consequently, it seems that you'll have to detect whether there's a picture on the website or not and then if there is a picture, search for the date and text as follows:

For the date, try:

timemess = soup.find(id="pubtime").get_text()

For the body text, it seems that the article is rather just the caption for the picture. Consequently, you could try the following:

bodymess = soup.find('img').findNext().get_text()

In brief, the soup.find('img') finds the image and findNext() goes to the next block which, coincidentally, contains the text.

Thus, in your code, I would do something as follows:

try:
    bodymess = soup.find(class_="article").get_text()

except AttributeError:
    bodymess = soup.find('img').findNext().get_text()

try:
    timemess = soup.find('span',{"class":"time"}).get_text()

except AttributeError:
    timemess = soup.find(id="pubtime").get_text()

As a general flow in web scraping, I usually go to the website itself using a browser and find the elements in the website backend in the browser first.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM