简体   繁体   中英

can't fetch the data using beautiful soup

I was trying to write simple script with Beautiful Soup which can scrap just two information and generate a SQL file please from a website.

import mechanize
import urlparse
from bs4 import BeautifulSoup

op = mechanize.Browser()
op.open("https://www.mentalhelp.net/symptoms/")
for link in op.links():
print link.text
print urlparse.urljoin(link.base_url, link.url)
get = BeautifulSoup(urllib2.urlopen("https://www.mentalhelp.net/symptoms/").read()).findAll('p')
print get
print "\n"

error:

C:\\Python27>python symtoms.py File "symtoms.py", line 8 print link.text ^ IndentationError: expected an indented block

I just want a script which will scrap those items and short descriptions and generate a SQL file which will have only two field "name" & "sug". "name" is those items and "sug" is those descriptions.

Indentation is important in Python , it is used to determine blocks , like for loop or if block or while loop or functions etc.

In the code you gave , the statement after the for loop is not correctly indented inside the for loop , and the for loop expects atleast one statement in its body , and I think you expected the lines below the for loop to be inside the for loop , so you should indent them inside the for loop .

Code -

for link in op.links():
    print link.text
    print urlparse.urljoin(link.base_url, link.url)
    get = BeautifulSoup(urllib2.urlopen("https://www.mentalhelp.net/symptoms/").read()).findAll('p')
    print get
    print "\n"

Though I am not sure if that would get what you want , it would fix your current error .


For the new requirement to get just the classic symptoms and its descrciption , you can use -

soup = BeautifulSoup(urllib2.urlopen("https://www.mentalhelp.net/symptoms/").read())
for div in soup.findAll('div',{'id':'page'}):
    for entrydiv in div.findAll('div',{'class':'h4 entry-title'}):
        print(entrydiv.get_text())
        print(entrydiv.next_sibling.get_text())

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM