简体   繁体   English

美丽的汤webscrape到mysql

[英]Beautiful soup webscrape into mysql

The code so far downloads and prints onto the screen,but how do I get that printed material into a sql database.If I wanted to get the data into CSV files it seems that Python(on a good day) creates the file automatically.Obviously with transferring into mySql I assume that I would have to create a database beforehand in order to receive the data.My question is how would I get the data from the scrape into the database omitting the csv step altogether. 到目前为止,代码已下载并打印到屏幕上,但是如何将打印的材料存储到sql数据库中。如果我想将数据存储到CSV文件中,似乎Python(在美好的一天)会自动创建文件。通过传输到mySql中,我假设我必须预先创建一个数据库才能接收数据。我的问题是如何将数据从抓取中获取到数据库中,而完全不使用csv步骤。 In anticipation I have already downloaded pyMySql library.Any suggestions much aprreciated..looknow 预期我已经下载了pyMySql库。任何建议都非常感谢..looknow

from urllib import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.officialcharts.com/charts/singles-      chart/19800203/7501/" )

 bsObj = BeautifulSoup(html)
 nameList = bsObj. findAll("div" , {"class" : "artist",})
 for name in nameList:
 print(name. get_text())

 html = urlopen("http://www.officialcharts.com/charts/singles-    chart/19800203/7501/" )
 bsObj = BeautifulSoup(html)
 nameList = bsObj. findAll("div" , {"class" : "title"})
 for name in nameList:
 print(name. get_text())     

So there are a couple things to address here. 因此,这里有两件事要解决。

The docs on PyMySQL are pretty good at getting you up and running. PyMySQL上的文档非常擅长让您启动并运行。

Before you can put these things into a database though, you need to grab them in a way that the artist and song name are associated with each other. 但是,在将这些内容放入数据库之前,您需要以一种将艺术家和歌曲名称相互关联的方式进行抓取。 Right now you are getting a separate list of artists and songs, with no way to associate them. 现在,您将获得艺术家和歌曲的单独列表,无法关联它们。 You will want to iterate over the title-artist class to do this. 您将需要遍历title-artist类来执行此操作。

I would do this like so - 我会这样-

from urllib import urlopen
from bs4 import BeautifulSoup
import pymysql.cursors

# Webpage connection
html = urlopen("http://www.officialcharts.com/charts/singles-chart/19800203/7501/")

# Grab title-artist classes and iterate
bsObj = BeautifulSoup(html)
recordList = bsObj.findAll("div", {"class" : "title-artist",})

# Now iterate over recordList to grab title and artist
for record in recordList:
     title = record.find("div", {"class": "title",}).get_text().strip()
     artist = record.find("div", {"class": "artist"}).get_text().strip()
     print artist + ': ' + title

This will print the title and artist for each iteration of the recordList loop. 这将为recordList循环的每次迭代打印标题和艺术家。

To insert these values into a MySQL DB, I created a table called artist_song with the following: 要将这些值插入到MySQL DB中,我创建了一个名为artist_song的表, artist_song包含以下内容:

CREATE TABLE `artist_song` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `artist` varchar(255) COLLATE utf8_bin NOT NULL,
  `song` varchar(255) COLLATE utf8_bin NOT NULL,
  PRIMARY KEY (`id`)
  ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin
  AUTO_INCREMENT=1;

This isn't the cleanest way to go about this, but the idea is sound. 这不是解决此问题的最干净的方法,但是这个想法很合理。 We want to open a connection to the MySQL DB (I have called my DB top_40), and insert an artist/title pair for each iteration of the recordList loop: 我们要打开一个与MySQL数据库的连接(我称我的数据库为top_40),并为recordList循环的每次迭代插入一个艺术家/标题对:

from urllib import urlopen
from bs4 import BeautifulSoup
import pymysql.cursors


# Webpage connection
html = urlopen("http://www.officialcharts.com/charts/singles-chart/19800203/7501/")

# Grab title-artist classes and store in recordList
bsObj = BeautifulSoup(html)
recordList = bsObj.findAll("div", {"class" : "title-artist",})

# Create a pymysql cursor and iterate over each title-artist record.
# This will create an INSERT statement for each artist/pair, then commit
# the transaction after reaching the end of the list. pymysql does not
# have autocommit enabled by default. After committing it will close
# the database connection.
# Create database connection

connection = pymysql.connect(host='localhost',
                             user='root',
                             password='password',
                             db='top_40',
                             charset='utf8mb4',
                             cursorclass=pymysql.cursors.DictCursor)

try:
    with connection.cursor() as cursor:
        for record in recordList:
            title = record.find("div", {"class": "title",}).get_text().strip()
            artist = record.find("div", {"class": "artist"}).get_text().strip()
            sql = "INSERT INTO `artist_song` (`artist`, `song`) VALUES (%s, %s)"
            cursor.execute(sql, (artist, title))
    connection.commit()
finally:
    connection.close()

Edit: Per my comment, I think it is clearer to iterate over the table rows instead: 编辑:根据我的评论,我认为遍历表行更为清晰:

from urllib import urlopen
from bs4 import BeautifulSoup
import pymysql.cursors


# Webpage connection
html = urlopen("http://www.officialcharts.com/charts/singles-chart/19800203/7501/")

bsObj = BeautifulSoup(html)

rows = bsObj.findAll('tr')
for row in rows:
    if row.find('span', {'class' : 'position'}):
        position = row.find('span', {'class' : 'position'}).get_text().strip()
        artist = row.find('div', {'class' : 'artist'}).get_text().strip()
        track = row.find('div', {'class' : 'title'}).get_text().strip()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM