I want to scrape from https://id.wikipedia.org/wiki/Demografi_Indonesia . There is a table that I need to extract.
I use this script
#import library yang dibutuhkan
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from urllib.request import urlopen
#buatlah request ke website
url = 'https://id.wikipedia.org/wiki/Demografi_Indonesia'
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')
#ambil table dengan class 'wikitable sortable'
soup = soup.find("table",{"class":"wikitable sortable"})
#cari data dengan tag 'td'
cells = soup.find_all('td')
#buatlah lists kosong
bps = []
nama = []
ibu_kota = []
populasi = []
luas = []
pulau = []
#memasukkan data ke dalam list berdasarkan pola HTML
if len(cells) > 0:
bps = cells[0]
bps.append(int(bps.text))
nama = cells[2]
nama.append(nama.text.strip())
ibu_kota = cells[4]
ibu_kota.append(ibu_kota.text.strip())
populasi = cells[5]
populasi.append(process_num(populasi.text.strip()))
luas = cells[6]
luas.append(process_num(luas.text.strip()))
pulau = cells[8]
pulau.append(pulau.text.strip())
#buatlah DatFrame dan masukkan ke CSV
df = pd.DataFrame(bps)
But it is raised an error
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-51-6130f70f1b21> in <module>
31 if len(cells) > 0:
32 bps = cells[0]
---> 33 bps.append(int(bps.text))
34
35 nama = cells[2]
~\anaconda3\lib\site-packages\bs4\element.py in append(self, tag)
412 :param tag: A PageElement.
413 """
--> 414 self.insert(len(self.contents), tag)
415
416 def extend(self, tags):
~\anaconda3\lib\site-packages\bs4\element.py in insert(self, position, new_child)
364 new_child.extract()
365
--> 366 new_child.parent = self
367 previous_child = None
368 if position == 0:
AttributeError: 'int' object has no attribute 'parent'
The output I desired is columns: BPS code, Name (Nama), Capital City(Ibu Kota), Population (Populasi), area (luas), island (Pulau).
How to workaround this situation?
You can use read_html
with [2]
for extract third DataFrame form list, select columns by positions by DataFrame.iloc
and set columns names by list:
url = 'https://id.wikipedia.org/wiki/Demografi_Indonesia'
pos = [0,2,4,5,6,8]
df = pd.read_html(url)[2].iloc[:, pos]
df.columns = ['bps','nama','ibu_kota','populasi','luas','pulau']
print (df.head())
bps nama ibu_kota populasi luas pulau
0 11 Aceh Banda Aceh 4.494.410 56.50051 Sumatra
1 12 Sumatra Utara Medan 12.982.204 72.42781 Sumatra
2 13 Sumatra Barat Padang 4.846.909 42.22465 Sumatra
3 14 Riau Pekanbaru 5.538.367 87.84423 Sumatra
4 15 Jambi Jambi 3.092.265 45.34849 Sumatra
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.