So far I have started with this. I can't get the normal text from div.
from BeautifulSoup import BeautifulSoup
import urllib2
get = BeautifulSoup(urllib2.urlopen("https://example/com/").read()).findAll('div', {'class':'h4 entry-title'})
import sys
for i in get:
print i
How can I scrap data from this HTML please ? I only need these color name and paragraph.
<div class="h4 entry-title">
<a href="https://example/com/01/">RED</a>
</div>
<p>
I am paragraph red
<p>
<div class="h4 entry-title">
<a href="https://example.com/02/">WHITE</a>
</div>
<p>
I am paragraph white
</p>
<div class="h4 entry-title">
<a href="https://example.com/03/">PINK</a>
</div>
<p>
I am paragraph pink
</p>
My Questions:
Output I need in console:
RED I am paragraph red WHITE I am paragraph white PINK I am paragraph pink
Output Database table(name,description) I want:
name: RED,WHITE,PINK description: I am paragraph RED, I am paragraph WHITE, I am paragraph PINK
Answering question one, write it like this:
for div in BeautifulSoup(urllib2.urlopen("https://example/com/").read()).findAll('div', {'class':'h4 entry-title'}):
for a in div.findAll('a'):
print a.text
for p in div.findAll('p'):
print p.text
Try this solution:
from BeautifulSoup import BeautifulSoup
import urllib2
(...)
connection = ...
cursor = connection.cursor()
(...)
bs = BeautifulSoup(urllib2.urlopen("https://example/com/").read())
names = []
descriptions = []
for title in bs.findAll('div', {'class': 'h4 entry-title'}):
name = title.find('a').text
description = title.find('p').text
sdesc = description.split()
sdesc[-1] = sdesc[-1].upper()
names.append(name)
descriptions.append(' '.join(sdesc))
print name, description
cursor.execute("INSERT INTO table (name, description) VALUES (%s, %s)", (','.join(names), ', '.join(descriptions))
connection.commit()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.