简体   繁体   中英

Reading data from a website into Pandas, but the data is not in a typical table or csv format

Reading data from a website into Pandas, but the data at the website does not come in standard table or csv format. Here is the link with the data:

http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data

Note that the "rows" you see in the link are not the actual rows for the input dataset. Instead, each set of 10 "rows" on the webpage is a single row in the input dataset. Each space in the data is supposed to indicate a delimiter for a new column. The input dataset has 294 rows and 76 columns.

So here are the first two rows in the input dataset, as you see it on the webpage -- note that each row from the input dataset ends with the word "name" as the last value in each row:

1254 0 40 1 1 0 0
-9 2 140 0 289 -9 -9 -9
0 -9 -9 0 12 16 84 0
0 0 0 0 150 18 -9 7
172 86 200 110 140 86 0 0
0 -9 26 20 -9 -9 -9 -9
-9 -9 -9 -9 -9 -9 -9 12
20 84 0 -9 -9 -9 -9 -9
-9 -9 -9 -9 -9 1 1 1
1 1 -9. -9. name
1255 0 49 0 1 0 0
-9 3 160 1 180 -9 -9 -9
0 -9 -9 0 11 16 84 0
0 0 0 0 -9 10 9 7
156 100 220 106 160 90 0 0
1 2 14 13 -9 -9 -9 -9
-9 -9 -9 -9 -9 -9 -9 11
20 84 1 -9 -9 2 -9 -9
-9 -9 -9 -9 -9 1 1 1
1 1 -9. -9. name

When I read the data in using pd.read_csv, Pandas thinks each row on the webpage is a single row in a dataset so I get one long column with each of these rows as strings. Instead of getting 294 rows with 76 columns, I get 2940 rows with 1 column of strings.

My desired output dataframe would put each set of 10 rows into a single row and then split all of the values by whitespace as a delimiter.

Unfortunately pd.read_csv isn't very flexible when it comes to custom line endings (they can only be a single character). I would suggest defining your own function to read from the file and yield one "row" at a time, where a row is everything between 'name' s. For example:

def my_data_file_reader(file_name):
    with open(file_name) as f:      # read from your datafile
        row = []                    # store incomplete rows here
        for line in f:              # iterate through each line
            line = line.split()
            row.extend(line)        # add each line to row (flattened)
            if line[-1] == 'name':  # yield row and reset it if a line ends with 'name'
                yield row
                row = []  

And then build your dataframe using pd.DataFrame instead of pd.read_csv

import pandas as pd

df = pd.DataFrame(my_data_file_reader('datafile.data'))

If your 'datafile.data' only contains the two rows given in your example then you can expect df to look something like:

print(df)
     0  1   2  3  4  5  6   7  8    9   ...   66  67 68 69 70 71 72   73   74  \
0  1254  0  40  1  1  0  0  -9  2  140  ...   -9  -9  1  1  1  1  1  -9.  -9.
1  1255  0  49  0  1  0  0  -9  3  160  ...   -9  -9  1  1  1  1  1  -9.  -9.

     75
0  name
1  name

[2 rows x 76 columns]
link = 'https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data'

import urllib.request
import io, re


ln1 = " ".join([re.sub('\n', ' ',el) 
               for el 
               in [f'{ch}'
               for ch 
               in urllib.request.urlopen(link)]])

df = pd.read_csv(io.StringIO('\n'.join(re.split('name', ln1))), delim_whitespace=True)

df.head()  

   1254  0  40  1  1.1  0.1  0.2  -9  2  140  ...    -9.26  -9.27  -9.28  1.2  1.3  1.4  1.5  1.6  -9.  -9..1      
0  1255  0  49  0    1    0    0  -9  3  160  ...       -9     -9     -9    1    1    1    1    1 -9.0   -9.0      
1  1256  0  37  1    1    0    0  -9  2  130  ...       -9     -9     -9    1    1    1    1    1 -9.0   -9.0      
2  1257  0  48  0    1    1    1  -9  4  138  ...       -9      2     -9    1    1    1    1    1 -9.0   -9.0      
3  1258  0  54  1    1    0    1  -9  3  150  ...       -9      1     -9    1    1    1    1    1 -9.0   -9.0      
4  1259  0  39  1    1    0    1  -9  3  120  ...       -9     -9     -9    1    1    1    1    1 -9.0   -9.0 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM