简体   繁体   English

web 将数据抓取到 python 上的 csv 文件,以及抓取链接的代码

[英]web scraping data to csv file on python, and the code to scrape a link

1 - when I check the csv file I only find data from the last link (Tugende). 1 - 当我检查 csv 文件时,我只找到最后一个链接(图根德)的数据。 but when I print the data I get all what I want.但是当我打印数据时,我得到了我想要的一切。 how can I get all the data in the csv file?如何获取 csv 文件中的所有数据?

2 - for the ' source ' variable how can I get only the article link from it and add it to csv file. 2 - 对于“”变量,我如何才能从中仅获取文章链接并将其添加到 csv 文件中。

import requests
from bs4 import BeautifulSoup as bs
import csv

url = "https://digestafrica.com/companies/{}"
startups = ['OBM-Education','Crafty-Workshop','Planet42','Paylend','Tugende']
for startup in startups:
    u = url.format(startup)
    html_text = requests.get(u).text
    soup = bs(html_text, 'lxml')
    
    list1 = soup.find_all('div', class_='d-flex flex-wrap content mt-24 border p-2 border-dark')
    source1 =soup.find_all('div',class_='col-md-2 mt-3 mt-lg-0')
    file = open('funding.csv', 'w',newline='')
    writer = csv.writer(file)
    mama = (['Name', 'Type', 'date','amount','investors'])
    writer.writerow(mama)



    for L in list1:      
        name1 = L.find('span', class_="line-height-1").text
        amount1 = L.find('div', class_='p-0').text.replace('Amount','').strip()
        date1 = L.find('span', class_="pt-0").text
        funding_type1 = L.find('div', class_="col-md-2 mt-2 mt-lg-0").text.replace('Funding Round','')
        investor1 = L.find('div',class_='col-md-3 mt-3 mt-lg-0').text.replace('investors','')
        source =L.find('div',class_="col-md-2 mt-3 mt-lg-0")
        
        print(name1, funding_type1, date1,amount1, investor1)

        writer.writerow([name1, funding_type1, date1,amount1, investor1])
    file.close()

The reason you only get data for the final startup is because of how you are opening your output file:您仅获取最终启动数据的原因是因为您打开 output 文件的方式:

    file = open('funding.csv', 'w',newline='')

This opens the file for writing, as requested, but places the "start of file" pointer at the very start of the file.这会根据要求打开文件进行写入,但会将“文件开头”指针放在文件的最开头。 This is fine the first time you go through the loop, but not subsequently.一次通过循环时这很好,但随后不是。

If you really want to open the file in the loop, you'll need to use a (for "open for writing, but in append mode if it already exists ").如果你真的想在循环中打开文件,你需要a (for "open for writing, but in append mode if it already exists ")。

That's not efficient, however.然而,这并不有效。 I suggest opening the file for writing prior to starting your for loop, and creating the writer object then too:我建议在开始for循环之前打开文件进行写入,然后也创建写入器 object:

writer = csv.writer(open('funding.csv', 'w', newline=''))
for startup in startups
....

[do loop operations]
....
writer.close()

And do the close() operation after the loop ends.并在循环结束后执行close()操作。

There will be a difference in results when you print(element.find()) and save your element.当您 print(element.find()) 并保存您的元素时,结果会有所不同。
Actualy element.find() returns bs4.element.Tag and not a str.实际上 element.find() 返回 bs4.element.Tag 而不是 str。
In your case you don't see it, because python applies str(element.find()) when it prints something.在您的情况下,您看不到它,因为 python 在打印某些内容时会应用 str(element.find()) 。
You need to do a cast or it can lead to unwanted results.您需要进行强制转换,否则可能会导致不需要的结果。
Example:例子:

element = BeautifulSoup('<div></div>')
print(type(element.find()))
print(type(str(element.find())))

1: You should use a context manager to handle the csv file when you write to it. 1:您应该在写入 csv 文件时使用上下文管理器来处理它。 I've fixed your code below, first I add the headers in "w" mode (so it writes the file when you first run the code) then I append "a" the data to it as I scrape each page.我在下面修复了你的代码,首先我在“w”模式下添加标题(所以它在你第一次运行代码时写入文件)然后我 append 在我刮每一页时将数据“a”到它。

2: You need to find the 'a' tag where the source link is, then get the href attribute like this: find('a')['href'] see below 2:你需要找到源链接所在的'a'标签,然后像这样获取href属性:find('a')['href']见下文

import requests
from bs4 import BeautifulSoup as bs
import csv

#write header
with open('funding.csv','w',newline='') as file:
    writer = csv.writer(file)
    mama = (['Name', 'Type', 'date','amount','investors','source'])
    writer.writerow(mama)

url = "https://digestafrica.com/companies/{}"
startups = ['OBM-Education','Crafty-Workshop','Planet42','Paylend','Tugende']

for startup in startups:

    html_text = requests.get(url.format(startup))
    soup = bs(html_text.text,'lxml')

    for list1 in soup.find_all('div', class_='d-flex flex-wrap content mt-24 border p-2 border-dark'):
        name1 = list1.find('span', class_="line-height-1").text
        amount1 = list1.find('div', class_='p-0').text.replace('Amount','').strip()
        date1 = list1.find('span', class_="pt-0").text
        funding_type1 = list1.find('div', class_="col-md-2 mt-2 mt-lg-0").text.replace('Funding Round','')
        investor1 = list1.find('div',class_='col-md-3 mt-3 mt-lg-0').text.replace('investors','')
        source = list1.find('div',class_="col-md-2 mt-3 mt-lg-0").find('a')['href']

        print(name1, funding_type1, date1,amount1, investor1, source)

        with open('funding.csv','a',newline='') as file:
            writer = csv.writer(file)
            writer.writerow([name1, funding_type1, date1,amount1, investor1, source])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM