简体   繁体   中英

AttributeError: 'int' object has no attribute 'parent' when scraping wikipedia

I want to scrape from https://id.wikipedia.org/wiki/Demografi_Indonesia . There is a table that I need to extract.

I use this script

#import library yang dibutuhkan
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from urllib.request import urlopen

#buatlah request ke website
url = 'https://id.wikipedia.org/wiki/Demografi_Indonesia'
html = urlopen(url) 
soup = BeautifulSoup(html, 'html.parser')

#ambil table dengan class 'wikitable sortable'
soup = soup.find("table",{"class":"wikitable sortable"})

#cari data dengan tag 'td'
cells = soup.find_all('td')

#buatlah lists kosong 
bps = []
nama = []
ibu_kota = []
populasi = []
luas = []
pulau = []

#memasukkan data ke dalam list berdasarkan pola HTML
if len(cells) > 0:
    bps = cells[0]
    nama = cells[2]
    ibu_kota = cells[4]
    populasi = cells[5]
    luas = cells[6]
    pulau = cells[8]

#buatlah DatFrame dan masukkan ke CSV

df = pd.DataFrame(bps)

But it is raised an error

AttributeError                            Traceback (most recent call last)
<ipython-input-51-6130f70f1b21> in <module>
     31 if len(cells) > 0:
     32     bps = cells[0]
---> 33     bps.append(int(bps.text))
     35     nama = cells[2]

~\anaconda3\lib\site-packages\bs4\element.py in append(self, tag)
    412         :param tag: A PageElement.
    413         """
--> 414         self.insert(len(self.contents), tag)
    416     def extend(self, tags):

~\anaconda3\lib\site-packages\bs4\element.py in insert(self, position, new_child)
    364             new_child.extract()
--> 366         new_child.parent = self
    367         previous_child = None
    368         if position == 0:

AttributeError: 'int' object has no attribute 'parent'

The output I desired is columns: BPS code, Name (Nama), Capital City(Ibu Kota), Population (Populasi), area (luas), island (Pulau).

How to workaround this situation?

You can use read_html with [2] for extract third DataFrame form list, select columns by positions by DataFrame.iloc and set columns names by list:

url = 'https://id.wikipedia.org/wiki/Demografi_Indonesia'

pos = [0,2,4,5,6,8]
df = pd.read_html(url)[2].iloc[:, pos]
df.columns = ['bps','nama','ibu_kota','populasi','luas','pulau']
print (df.head())
   bps           nama    ibu_kota    populasi      luas    pulau
0   11           Aceh  Banda Aceh   4.494.410  56.50051  Sumatra
1   12  Sumatra Utara       Medan  12.982.204  72.42781  Sumatra
2   13  Sumatra Barat      Padang   4.846.909  42.22465  Sumatra
3   14           Riau   Pekanbaru   5.538.367  87.84423  Sumatra
4   15          Jambi       Jambi   3.092.265  45.34849  Sumatra

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM