简体   繁体   中英

Extract text from within parenthesis into pandas dataframe

I am slightly new to scraping data using python and am attempting to pull data off this page into a pandas dataframe with the column headers as shown in that page.

Right now I have the following code that allows me to pull the data off the page but I can't quite figure out the next steps to get the data in the format I need.

import requests

url = 'https://mspotrace.org.my/Opmc_list/getCBbyfilters'

r = requests.get(url)
page = requests.get(url).text

You can read the tables from the url directly using the pandas api.

>>> import pandas as pd
>>> url = 'https://mspotrace.org.my/Opmc_list'
>>> df = pd.read_html(url)
>>> df[0]

pandas api, read_html reads all the tables and returns a list of dataframes In your case there is only one table in that url. So the desired dataframe is at index 0

EDIT

Try this

>>> data = json.loads(page)
>>> df = pd.DataFrame(data)
>>> df
      draw  recordsTotal  recordsFiltered                                               data
0        0          2654             2654  [OPMC31001, Apave Malaysia Sdn Bhd, Part 3, Ka...
1        0          2654             2654  [OPMC31002, Apave Malaysia Sdn Bhd, Part 3, Ko...
2        0          2654             2654  [OPMC31003, Apave Malaysia Sdn Bhd, Part 3, Ko...
3        0          2654             2654  [OPMC31004, Apave Malaysia Sdn Bhd, Part 3, Ko...
4        0          2654             2654  [OPMC31005, Apave Malaysia Sdn Bhd, Part 3, Ko...
...    ...           ...              ...                                                ...
2649     0          2654             2654  [SCCS2333, Trans Certification Interntional Sd...
2650     0          2654             2654  [SCCS2351, TUV Rheinland Malaysia Sdn. Bhd., S...
2651     0          2654             2654  [SCCS1636, DQS Certification (M) Sdn Bhd, SCCS...
2652     0          2654             2654  [SCCS2906, TUV NORD (MALAYSIA) SDN BHD, SCCS, ...
2653     0          2654             2654  [SCCS02085, BSI Services Malaysia Sdn Bhd, SCC...

[2654 rows x 4 columns]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM