I have written the following code to extract the table data using BeautifulSoup I
import requests
website= requests.get('https://memeburn.com/2010/09/the-100-most-influential-news-media-twitter-accounts/').text
from bs4 import BeautifulSoup
soup= BeautifulSoup(website, 'lxml')
table= soup.find('table')
table_rows = table.findAll('tr')
for tr in table_rows:
td= tr.findAll('td')
rows = [i.text for i in td]
print(rows)
This is my output
['Number', '@name', 'Name', 'Followers', 'Influence Rank']
[]
['1', '@mashable', 'Pete Cashmore', '2037840', '59']
[]
['2', '@cnnbrk', 'CNN Breaking News', '3224475', '71']
[]
['3', '@big_picture', 'The Big Picture', '23666', '92']
[]
['4', '@theonion', 'The Onion', '2289939', '116']
[]
['5', '@time', 'TIME.com', '2111832', '143']
[]
['6', '@breakingnews', 'Breaking News', '1795976', '147']
[]
['7', '@bbcbreaking', 'BBC Breaking News', '509756', '168']
[]
['8', '@espn', 'ESPN', '572577', '187']
[]
Help me write this data into .csv file please (I am new to this kind of task)
use csv writer. write each row to csv file.
import requests
import csv
from bs4 import BeautifulSoup
website= requests.get('https://memeburn.com/2010/09/the-100-most-influential-news-media-twitter-accounts/').text
soup= BeautifulSoup(website, 'lxml')
table= soup.find('table')
table_rows = table.findAll('tr')
csvfile = 'twitterusers2.csv';
# Python 2
# with open(csvfile, 'wb') as outfile:
# Python 3 to ommit newline caracter
with open(csvfile, 'w', newline='') as outfile:
wr = csv.writer(outfile)
for tr in table_rows:
td= tr.findAll('td')
# Python 2 .encode("utf8") is mendatory sometimes playing with twitter data
rows = [i.text.encode("utf8") for i in td]
#ignore the empty elements and row td count not equal to 5
if(len(rows) == 5):
print(rows)
wr.writerow(rows)
A better solution is to use pandas
as it is faster than other libraries. Here is the entire code:
import requests
import pandas as pd
website= requests.get('https://memeburn.com/2010/09/the-100-most-influential-news-media-twitter-accounts/').text
from bs4 import BeautifulSoup
soup= BeautifulSoup(website, 'lxml')
table= soup.find('table')
table_rows = table.findAll('tr')
first = True
details_dict = {}
count = 0
final_rows = []
for tr in table_rows:
td= tr.findAll('td')
rows = [i.text for i in td]
#print(rows)
for i in rows:
if first == True:
details_dict[i] = []
else:
key = list(details_dict.keys())[count]
details_dict[key].append(i)
count+=1
count = 0
first = False
#print(details_dict)
df = pd.DataFrame(details_dict)
df.to_csv('D:\\Output.csv',index = False)
Output Screenshot:
Hope that this helps!
The easiest way is to use pandas
:
# pip install pandas lxml beautifulsoup4
import pandas as pd
URI = 'https://memeburn.com/2010/09/the-100-most-influential-news-media-twitter-accounts/'
# read and clean
data = pd.read_html(URI, flavor='lxml', skiprows=0, header=0)[0].dropna()
# save to csv called data
data.to_csv('data.csv', index=False, encoding='utf-8')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.