简体   繁体   中英

How to use pd.read_html() to extract table data from a website without error in Python?

I have created a program that collects table data at the following location. And when extracting data in soup library, it appears fine, but when converting html codes to a table using pandas library pd.read_html(table) I get an error message, I don't know why

this code bellow :

import requests
from bs4 import BeautifulSoup
import pandas as pd

header = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}

url = "https://www.worldometers.info/coronavirus/#countries"
req = requests.get(url, headers=header)

#test of response server (permession for client)
""" True response with 200 and not forbiden access """

#soup methode : extract data finded in html page from the link
soup = BeautifulSoup(req.content, 'lxml')
tables = soup.find('table',{'id':'main_table_countries_today'})
df = pd.read_html(tables)
print(df)

after excute :

Traceback (most recent call last):
  File "c:\Users\pc\Desktop\manual scrape\scrape 1\covid find all data.py", line 18, in <module>
    df = pd.read_html(tables)
  File "C:\Users\pc\AppData\Roaming\Python\Python39\site-packages\pandas\util\_decorators.py", line 299, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\pc\AppData\Roaming\Python\Python39\site-packages\pandas\io\html.py", line 1085, in read_html
    return _parse(
  File "C:\Users\pc\AppData\Roaming\Python\Python39\site-packages\pandas\io\html.py", line 893, in _parse
    tables = p.parse_tables()
  File "C:\Users\pc\AppData\Roaming\Python\Python39\site-packages\pandas\io\html.py", line 213, in parse_tables
    tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
  File "C:\Users\pc\AppData\Roaming\Python\Python39\site-packages\pandas\io\html.py", line 717, in _build_doc
    r = parse(self.io, parser=parser)
  File "C:\Users\pc\AppData\Roaming\Python\Python39\site-packages\lxml\html\__init__.py", line 939, in parse
    return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
  File "src\lxml\etree.pyx", line 3521, in lxml.etree.parse
  File "src\lxml\parser.pxi", line 1875, in lxml.etree._parseDocument
TypeError: 'NoneType' object is not callable
PS C:\Users\pc\Desktop\manual scrape\scrape 1>

The program aims to print table data at code execution time using pd.read_html() Such as :

this table screenshot

Convert the soup to string before passing in into .read_html() :

import requests
from bs4 import BeautifulSoup
import pandas as pd

header = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"
}

url = "https://www.worldometers.info/coronavirus/#countries"
req = requests.get(url, headers=header)

# test of response server (permession for client)
""" True response with 200 and not forbiden access """

# soup methode : extract data finded in html page from the link
soup = BeautifulSoup(req.content, "lxml")
tables = soup.find("table", {"id": "main_table_countries_today"})
df = pd.read_html(str(tables).upper())[0]  # <-- convert the soup to str first
print(df)

Prints:

         #           COUNTRY,OTHER  TOTALCASES  NEWCASES  TOTALDEATHS  NEWDEATHS  TOTALRECOVERED  NEWRECOVERED  ACTIVECASES  SERIOUS,CRITICAL  TOT CASES/1M POP  DEATHS/1M POP   TOTALTESTS  TESTS/ 1M POP    POPULATION          CONTINENT  1 CASEEVERY X PPL  1 DEATHEVERY X PPL  1 TESTEVERY X PPL  NEW CASES/1M POP  NEW DEATHS/1M POP  ACTIVE CASES/1M POP
0      NaN           NORTH AMERICA    48516635  142956.0    1000760.0     2587.0      37927770.0       60945.0    9588105.0           32950.0               NaN            NaN          NaN            NaN           NaN      NORTH AMERICA                NaN                 NaN                NaN               NaN                NaN                  NaN
1      NaN                    ASIA    70629576  250387.0    1043218.0     3825.0      65875219.0      257280.0    3711139.0           40409.0               NaN            NaN          NaN            NaN           NaN               ASIA                NaN                 NaN                NaN               NaN                NaN                  NaN
2      NaN           SOUTH AMERICA    36987244   27316.0    1132455.0      757.0      34786052.0        1198.0    1068737.0           15195.0               NaN            NaN          NaN            NaN           NaN      SOUTH AMERICA                NaN                 NaN                NaN               NaN                NaN                  NaN
3      NaN                  EUROPE    55607330  135394.0    1176585.0     1575.0      50739008.0      129944.0    3691737.0           11644.0               NaN            NaN          NaN            NaN           NaN             EUROPE                NaN                 NaN                NaN               NaN                NaN                  NaN
4      NaN                  AFRICA     7902675   11584.0     197645.0      305.0       7027951.0       16775.0     677079.0            4464.0               NaN            NaN          NaN            NaN           NaN             AFRICA                NaN                 NaN                NaN               NaN                NaN                  NaN
5      NaN                 OCEANIA      166231    1768.0       2196.0        8.0        118215.0        1108.0      45820.0             237.0               NaN            NaN          NaN            NaN           NaN  AUSTRALIA/OCEANIA                NaN                 NaN                NaN               NaN                NaN                  NaN
6      NaN                     NaN         721       NaN         15.0        NaN           706.0           NaN          0.0               0.0               NaN            NaN          NaN            NaN           NaN                NaN                NaN                 NaN                NaN               NaN                NaN                  NaN
7      NaN                   WORLD   219810412  569405.0    4552874.0     9057.0     196474921.0      467250.0   18782617.0          104899.0           28200.0          584.1          NaN            NaN           NaN                ALL                NaN                 NaN                NaN               NaN                NaN                  NaN
8      1.0                     USA    40449279  115125.0     661200.0     1237.0      31175601.0       37665.0    8612478.0           25675.0          121372.0         1984.0  588098779.0      1764643.0  3.332678e+08      NORTH AMERICA                8.0               504.0                1.0            345.00               4.00              25843.0
9      2.0                   INDIA    32902293   45430.0     439900.0      341.0      32056062.0       34965.0     406331.0            8944.0           23572.0          315.0  524868734.0       376027.0  1.395828e+09               ASIA               42.0              3173.0                3.0             33.00               0.20                291.0
10     3.0                  BRAZIL    20830495   26280.0     581914.0      686.0      19775873.0           NaN     472708.0            8318.0           97192.0         2715.0   56897224.0       265475.0  2.143224e+08      SOUTH AMERICA               10.0               368.0                4.0            123.00               3.00               2206.0
11     4.0                  RUSSIA     6956318   18985.0     184812.0      798.0       6218048.0       18669.0     553458.0            2300.0           47644.0         1266.0  179500000.0      1229389.0  1.460075e+08             EUROPE               21.0               790.0                1.0            130.00               5.00               3791.0

...and so on.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM