简体   繁体   English

网页抓取中的空 CSV - Python

[英]Empty CSV in web scraping - Python

I try to create a CSV for all the tables which appear in each link.我尝试为每个链接中出现的所有表创建一个 CSV。 This is the link是链接

In the link there are 36 links, so 36 csv should be generated.在链接中有 36 个链接,因此应该生成 36 个 csv。 When I run my code, 36 csv are created but they are all empty.当我运行我的代码时,创建了 36 个 csv,但它们都是空的。 My code is below:我的代码如下:

import csv
import urllib2
from bs4 import BeautifulSoup




first=urllib2.urlopen("http://www.admision.unmsm.edu.pe/admisionsabado/A.html").read()
soup=BeautifulSoup(first)
w=[]
for q in soup.find_all('tr'):
    for link in q.find_all('a'):
        w.append(link["href"])



l=[]

for t in w:
    l.append(t.replace(".","",1))





def record (part) :


        url="http://www.admision.unmsm.edu.pe/admisionsabado".format(part)
        u=urllib2.urlopen(url)
        try:
            html=u.read()
        finally:
            u.close()
        soup=BeautifulSoup(html)
        c=[]
        for n in soup.find_all('center'):
            for b in n.find_all('a')[2:]:
                c.append(b.text)

        t=(len(c))/2
        part=part[:-6]
        name=part.replace("/","")


        with open('{}.csv'.format(name), 'wb') as f:
            writer = csv.writer(f)
            for i in range(t):
                url = "http://www.admision.unmsm.edu.pe/admisionsabado{}{}.html".format(part,i)
                u = urllib2.urlopen(url)
                try:
                    html = u.read()
                finally:
                    u.close()
                soup=BeautifulSoup(html)
                for tr in soup.find_all('tr')[1:]:
                    tds = tr.find_all('td')
                    row = [elem.text.encode('utf-8') for elem in tds[:6]]
                    writer.writerow(row)

With this for , I run the created function to create the CSV per link.有了这个for ,我运行创建的函数来为每个链接创建 CSV。

 for n in l:
        record(n) 

EDIT: According to the advice of Alecxe , I change the code, and it's working OK just for the fist two links.编辑:根据Alecxe的建议,我更改了代码,它仅适用于前两个链接。 Moreover, There's a message HTTP Error 404: Not Found .此外,还有一条消息HTTP Error 404: Not Found I revise in the directory and there are just two csv which are created correctly.我在目录中修改,只有两个 csv 正确创建。

Here's the code:这是代码:

import csv
import urllib2
from bs4 import BeautifulSoup



    def record(part):
        soup = BeautifulSoup(urllib2.urlopen("http://www.admision.unmsm.edu.pe/admisionsabado".format(part)))
        c=[]
        for n in soup.find_all('center'):
            for b in n.find_all('a')[1:]:
                c.append(b.text)

        t = (len(links)) / 2
        part = part[:-6]
        name = part.replace("/", "")

        with open('{}.csv'.format(name), 'wb') as f:
            writer = csv.writer(f)
            for i in range(t):
                url = "http://www.admision.unmsm.edu.pe/admisionsabado{}{}.html".format(part, i)
                soup = BeautifulSoup(urllib2.urlopen(url))
                for tr in soup.find_all('tr')[1:]:
                    tds = tr.find_all('td')
                    row = [elem.text.encode('utf-8') for elem in tds[:6]]
                    writer.writerow(row)


    soup = BeautifulSoup(urllib2.urlopen("http://www.admision.unmsm.edu.pe/admisionsabado/A.html"))
    links = [tr.a["href"].replace(".", "", 1) for tr in soup.find_all('tr')]

    for link in links:
        record(link)

soup.find_all('center') finds nothing. soup.find_all('center')什么也没找到。

Replace:代替:

c=[]
for n in soup.find_all('center'):
    for b in n.find_all('a')[2:]:
        c.append(b.text)

with:和:

c = [link.text for link in soup.find('table').find_all('a')[2:]]

Also, you can pass urllib2.urlopen(url) directly to the BeautifulSoup constructor:此外,您可以将urllib2.urlopen(url)直接传递给BeautifulSoup构造函数:

soup = BeautifulSoup(urllib2.urlopen(url))

Also, since you have only one link in the row, you can simplify the way you are getting a list of links.此外,由于该行中只有一个链接,因此您可以简化获取链接列表的方式。 Instead of:代替:

w=[]
for q in soup.find_all('tr'):
    for link in q.find_all('a'):
        w.append(link["href"])

do this:做这个:

links = [tr.a["href"] for tr in soup.find_all('tr')]

Also, pay attention to how you are naming variables and code formatting.另外,请注意您如何命名变量和代码格式。 See:看:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM