简体   繁体   中英

How to write the web scraped data to csv?

I have written the following code to extract the table data using BeautifulSoup I

import requests

website= requests.get('https://memeburn.com/2010/09/the-100-most-influential-news-media-twitter-accounts/').text

from bs4 import BeautifulSoup
soup= BeautifulSoup(website, 'lxml')


table= soup.find('table')

table_rows = table.findAll('tr')

for tr in table_rows:
    td= tr.findAll('td')
    rows = [i.text for i in td]
    print(rows)

This is my output

['Number', '@name', 'Name', 'Followers', 'Influence Rank']
[]
['1', '@mashable', 'Pete Cashmore', '2037840', '59']
[]
['2', '@cnnbrk', 'CNN Breaking News', '3224475', '71']
[]
['3', '@big_picture', 'The Big Picture', '23666', '92']
[]
['4', '@theonion', 'The Onion', '2289939', '116']
[]
['5', '@time', 'TIME.com', '2111832', '143']
[]
['6', '@breakingnews', 'Breaking News', '1795976', '147']
[]
['7', '@bbcbreaking', 'BBC Breaking News', '509756', '168']
[]
['8', '@espn', 'ESPN', '572577', '187']
[]

Help me write this data into .csv file please (I am new to this kind of task)

use csv writer. write each row to csv file.

import requests
import csv
from bs4 import BeautifulSoup

website= requests.get('https://memeburn.com/2010/09/the-100-most-influential-news-media-twitter-accounts/').text

soup= BeautifulSoup(website, 'lxml')

table= soup.find('table')

table_rows = table.findAll('tr')

csvfile = 'twitterusers2.csv';

# Python 2
# with open(csvfile, 'wb') as outfile:
# Python 3 to ommit newline caracter
with open(csvfile, 'w', newline='') as outfile:
    wr = csv.writer(outfile)

    for tr in table_rows:
        td= tr.findAll('td')
        # Python 2 .encode("utf8") is mendatory sometimes playing with twitter data
        rows = [i.text.encode("utf8") for i in td]
        #ignore the empty elements and row td count not equal to 5
        if(len(rows) == 5):
            print(rows)
            wr.writerow(rows)

A better solution is to use pandas as it is faster than other libraries. Here is the entire code:

import requests
import pandas as pd 

website= requests.get('https://memeburn.com/2010/09/the-100-most-influential-news-media-twitter-accounts/').text

from bs4 import BeautifulSoup
soup= BeautifulSoup(website, 'lxml')

table= soup.find('table')

table_rows = table.findAll('tr')

first = True 

details_dict = {}

count = 0 

final_rows = []

for tr in table_rows:
    td= tr.findAll('td')
    rows = [i.text for i in td]
    #print(rows)
    
    for i in rows:
        if first == True:
            details_dict[i] = []
        else:
            key = list(details_dict.keys())[count]
            details_dict[key].append(i)
            count+=1 
    count = 0
    first = False 
    #print(details_dict)

df = pd.DataFrame(details_dict)
df.to_csv('D:\\Output.csv',index = False)

Output Screenshot:

在此处输入图片说明

Hope that this helps!

The easiest way is to use pandas :

# pip install pandas lxml beautifulsoup4

import pandas as pd

URI = 'https://memeburn.com/2010/09/the-100-most-influential-news-media-twitter-accounts/'

# read and clean
data = pd.read_html(URI, flavor='lxml', skiprows=0, header=0)[0].dropna()

# save to csv called data
data.to_csv('data.csv', index=False, encoding='utf-8')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM