Python/BeautifulSoup: Scrape Data from Web Pages

Question

I am a beginner in Python programming, and I am trying to learn how to scrape web pages. What I'm trying to do is to scrape data from this web page

I am trying to scrape the ISSUE DATE from the above page (You can see the ISSUE DATE if you open the web page). I am facing some problem with this.

This is the code I wrote for this.

import BeautifulSoup
import urllib2

url = "http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=0000149.PN.&OS=PN/0000149&RS=PN/0000149"

data = urllib2.urlopen(url).read()
soup = BeautifulSoup.BeautifulSoup(data)
value1 = soup.findAll('TABLE')

for value in value1:
    date1 = value.find('B').text
    print date1

Answer 1

If, before the loop, you add... print value1 you can see that the html has an error on line 37 at character 27, which is a closing double quote that is missing its opening double quote.

Answer 2

That's probably not optimized, but here's one way of getting back the issue date:

import BeautifulSoup
import urllib2

url = "http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=0000149.PN.&OS=PN/0000149&RS=PN/0000149"

data = urllib2.urlopen(url).read()
soup = BeautifulSoup.BeautifulSoup(data)
issue_date = soup.findAll('b')[5].text
print issue_date

Answer 3

BeautifulSoup needs the tag names to be in lower case. Note, also, that using a few try.. except blocks would make debugging this a bit easier. The following code seems to achieve what you want:

import BeautifulSoup
import urllib2

url = "http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=0000149.PN.&OS=PN/0000149&RS=PN/0000149"

data = urllib2.urlopen(url).read()
soup = BeautifulSoup.BeautifulSoup(data)
value1 = soup.findAll('table')
n=0
for value in value1:
    date1 = value.find('b')
    try: print n,date1.text
    except: print n
    n=n+1
try: print "The winner is:",value1[3].find('b').text
except: pass

Answer 4

If the structure is the same accross the pages you can do this

import BeautifulSoup
import urllib2

url = "http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=0000149.PN.&OS=PN/0000149&RS=PN/0000149"

data = urllib2.urlopen(url).read()
soup = BeautifulSoup.BeautifulSoup(data)
for td in soup.findAll('td'):
    if td.get('width','') !='' and td.get('width')=='80%':
        print td.text

Answer 5

Use find text function then iterate to next element with next function like this:

import requests
from bs4 import BeautifulSoup   
url="http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/PTO/srchnum.htm&r=1&f=G&l=50&s1=0000149.PN.&OS=PN/0000149&RS=PN/0000149"
html=requests.get(url).content
issue_date_zone = BeautifulSoup(html).find(text='Issue Date:')
date_str=issue_date_zone.next.next.text
print date_str

result is:

March 25, 1837

Python/BeautifulSoup: Scrape Data from Web Pages

Question

5 answers

solution1
0 2012-04-09 18:41:00

solution2
0 2012-04-09 18:41:53

solution3
0 2012-04-09 18:48:08

solution4
0 2013-11-07 21:19:03

solution5
0 2016-09-27 23:54:11

Python/BeautifulSoup: Scrape Data from Web Pages

Question

5 answers

solution1 0 2012-04-09 18:41:00

solution2 0 2012-04-09 18:41:53

solution3 0 2012-04-09 18:48:08

solution4 0 2013-11-07 21:19:03

solution5 0 2016-09-27 23:54:11

solution1
0 2012-04-09 18:41:00

solution2
0 2012-04-09 18:41:53

solution3
0 2012-04-09 18:48:08

solution4
0 2013-11-07 21:19:03

solution5
0 2016-09-27 23:54:11