Reading data from a website into Pandas, but the data at the website does not come in standard table or csv format. Here is the link with the data:
http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data
Note that the "rows" you see in the link are not the actual rows for the input dataset. Instead, each set of 10 "rows" on the webpage is a single row in the input dataset. Each space in the data is supposed to indicate a delimiter for a new column. The input dataset has 294 rows and 76 columns.
So here are the first two rows in the input dataset, as you see it on the webpage -- note that each row from the input dataset ends with the word "name" as the last value in each row:
1254 0 40 1 1 0 0
-9 2 140 0 289 -9 -9 -9
0 -9 -9 0 12 16 84 0
0 0 0 0 150 18 -9 7
172 86 200 110 140 86 0 0
0 -9 26 20 -9 -9 -9 -9
-9 -9 -9 -9 -9 -9 -9 12
20 84 0 -9 -9 -9 -9 -9
-9 -9 -9 -9 -9 1 1 1
1 1 -9. -9. name
1255 0 49 0 1 0 0
-9 3 160 1 180 -9 -9 -9
0 -9 -9 0 11 16 84 0
0 0 0 0 -9 10 9 7
156 100 220 106 160 90 0 0
1 2 14 13 -9 -9 -9 -9
-9 -9 -9 -9 -9 -9 -9 11
20 84 1 -9 -9 2 -9 -9
-9 -9 -9 -9 -9 1 1 1
1 1 -9. -9. name
When I read the data in using pd.read_csv, Pandas thinks each row on the webpage is a single row in a dataset so I get one long column with each of these rows as strings. Instead of getting 294 rows with 76 columns, I get 2940 rows with 1 column of strings.
My desired output dataframe would put each set of 10 rows into a single row and then split all of the values by whitespace as a delimiter.
Unfortunately pd.read_csv
isn't very flexible when it comes to custom line endings (they can only be a single character). I would suggest defining your own function to read from the file and yield one "row" at a time, where a row is everything between 'name'
s. For example:
def my_data_file_reader(file_name):
with open(file_name) as f: # read from your datafile
row = [] # store incomplete rows here
for line in f: # iterate through each line
line = line.split()
row.extend(line) # add each line to row (flattened)
if line[-1] == 'name': # yield row and reset it if a line ends with 'name'
yield row
row = []
And then build your dataframe using pd.DataFrame
instead of pd.read_csv
import pandas as pd
df = pd.DataFrame(my_data_file_reader('datafile.data'))
If your 'datafile.data'
only contains the two rows given in your example then you can expect df
to look something like:
print(df)
0 1 2 3 4 5 6 7 8 9 ... 66 67 68 69 70 71 72 73 74 \
0 1254 0 40 1 1 0 0 -9 2 140 ... -9 -9 1 1 1 1 1 -9. -9.
1 1255 0 49 0 1 0 0 -9 3 160 ... -9 -9 1 1 1 1 1 -9. -9.
75
0 name
1 name
[2 rows x 76 columns]
link = 'https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data'
import urllib.request
import io, re
ln1 = " ".join([re.sub('\n', ' ',el)
for el
in [f'{ch}'
for ch
in urllib.request.urlopen(link)]])
df = pd.read_csv(io.StringIO('\n'.join(re.split('name', ln1))), delim_whitespace=True)
df.head()
1254 0 40 1 1.1 0.1 0.2 -9 2 140 ... -9.26 -9.27 -9.28 1.2 1.3 1.4 1.5 1.6 -9. -9..1
0 1255 0 49 0 1 0 0 -9 3 160 ... -9 -9 -9 1 1 1 1 1 -9.0 -9.0
1 1256 0 37 1 1 0 0 -9 2 130 ... -9 -9 -9 1 1 1 1 1 -9.0 -9.0
2 1257 0 48 0 1 1 1 -9 4 138 ... -9 2 -9 1 1 1 1 1 -9.0 -9.0
3 1258 0 54 1 1 0 1 -9 3 150 ... -9 1 -9 1 1 1 1 1 -9.0 -9.0
4 1259 0 39 1 1 0 1 -9 3 120 ... -9 -9 -9 1 1 1 1 1 -9.0 -9.0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.