AttributeError: 'int' object has no attribute 'parent' when scraping wikipedia

Question

I want to scrape from https://id.wikipedia.org/wiki/Demografi_Indonesia . There is a table that I need to extract.

I use this script

#import library yang dibutuhkan
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from urllib.request import urlopen

#buatlah request ke website
url = 'https://id.wikipedia.org/wiki/Demografi_Indonesia'
html = urlopen(url) 
soup = BeautifulSoup(html, 'html.parser')

#ambil table dengan class 'wikitable sortable'
soup = soup.find("table",{"class":"wikitable sortable"})

#cari data dengan tag 'td'
cells = soup.find_all('td')
    

#buatlah lists kosong 
bps = []
nama = []
ibu_kota = []
populasi = []
luas = []
pulau = []

#memasukkan data ke dalam list berdasarkan pola HTML
if len(cells) > 0:
    bps = cells[0]
    bps.append(int(bps.text))
    
    nama = cells[2]
    nama.append(nama.text.strip())
    
    ibu_kota = cells[4]
    ibu_kota.append(ibu_kota.text.strip())
    
    populasi = cells[5]
    populasi.append(process_num(populasi.text.strip()))
    
    luas = cells[6]
    luas.append(process_num(luas.text.strip()))
    
    pulau = cells[8]
    pulau.append(pulau.text.strip())

#buatlah DatFrame dan masukkan ke CSV

df = pd.DataFrame(bps)

But it is raised an error

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-51-6130f70f1b21> in <module>
     31 if len(cells) > 0:
     32     bps = cells[0]
---> 33     bps.append(int(bps.text))
     34 
     35     nama = cells[2]

~\anaconda3\lib\site-packages\bs4\element.py in append(self, tag)
    412         :param tag: A PageElement.
    413         """
--> 414         self.insert(len(self.contents), tag)
    415 
    416     def extend(self, tags):

~\anaconda3\lib\site-packages\bs4\element.py in insert(self, position, new_child)
    364             new_child.extract()
    365 
--> 366         new_child.parent = self
    367         previous_child = None
    368         if position == 0:

AttributeError: 'int' object has no attribute 'parent'

The output I desired is columns: BPS code, Name (Nama), Capital City(Ibu Kota), Population (Populasi), area (luas), island (Pulau).

How to workaround this situation?

Answer 1

You can use read_html with [2] for extract third DataFrame form list, select columns by positions by DataFrame.iloc and set columns names by list:

url = 'https://id.wikipedia.org/wiki/Demografi_Indonesia'

pos = [0,2,4,5,6,8]
df = pd.read_html(url)[2].iloc[:, pos]
df.columns = ['bps','nama','ibu_kota','populasi','luas','pulau']
print (df.head())
   bps           nama    ibu_kota    populasi      luas    pulau
0   11           Aceh  Banda Aceh   4.494.410  56.50051  Sumatra
1   12  Sumatra Utara       Medan  12.982.204  72.42781  Sumatra
2   13  Sumatra Barat      Padang   4.846.909  42.22465  Sumatra
3   14           Riau   Pekanbaru   5.538.367  87.84423  Sumatra
4   15          Jambi       Jambi   3.092.265  45.34849  Sumatra

AttributeError: 'int' object has no attribute 'parent' when scraping wikipedia

Question

1 answers

solution1
3 ACCPTED 2020-04-18 10:55:01

AttributeError: 'int' object has no attribute 'parent' when scraping wikipedia

Question

1 answers

solution1 3 ACCPTED 2020-04-18 10:55:01

solution1
3 ACCPTED 2020-04-18 10:55:01