简体   繁体   English

将数据从网站读取到Pandas中,但是数据不是典型的表格或csv格式

[英]Reading data from a website into Pandas, but the data is not in a typical table or csv format

Reading data from a website into Pandas, but the data at the website does not come in standard table or csv format. 从网站读取数据到Pandas,但是网站上的数据不是标准表格或csv格式。 Here is the link with the data: 这是数据链接:

http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data

Note that the "rows" you see in the link are not the actual rows for the input dataset. 请注意,您在链接中看到的“行”不是输入数据集的实际行。 Instead, each set of 10 "rows" on the webpage is a single row in the input dataset. 而是,网页上每10组“行”在输入数据集中都是一行。 Each space in the data is supposed to indicate a delimiter for a new column. 数据中的每个空格都应指示一个新列的定界符。 The input dataset has 294 rows and 76 columns. 输入数据集具有294行和76列。

So here are the first two rows in the input dataset, as you see it on the webpage -- note that each row from the input dataset ends with the word "name" as the last value in each row: 因此,这是输入数据集的前两行,就像您在网页上看到的那样-请注意,输入数据集的每一行都以单词“ name”作为每行的最后一个值结尾:

1254 0 40 1 1 0 0
-9 2 140 0 289 -9 -9 -9
0 -9 -9 0 12 16 84 0
0 0 0 0 150 18 -9 7
172 86 200 110 140 86 0 0
0 -9 26 20 -9 -9 -9 -9
-9 -9 -9 -9 -9 -9 -9 12
20 84 0 -9 -9 -9 -9 -9
-9 -9 -9 -9 -9 1 1 1
1 1 -9. -9. name
1255 0 49 0 1 0 0
-9 3 160 1 180 -9 -9 -9
0 -9 -9 0 11 16 84 0
0 0 0 0 -9 10 9 7
156 100 220 106 160 90 0 0
1 2 14 13 -9 -9 -9 -9
-9 -9 -9 -9 -9 -9 -9 11
20 84 1 -9 -9 2 -9 -9
-9 -9 -9 -9 -9 1 1 1
1 1 -9. -9. name

When I read the data in using pd.read_csv, Pandas thinks each row on the webpage is a single row in a dataset so I get one long column with each of these rows as strings. 当我使用pd.read_csv读取数据时,Pandas认为网页上的每一行都是数据集中的一行,因此我得到了一个长列,其中每一行都是字符串。 Instead of getting 294 rows with 76 columns, I get 2940 rows with 1 column of strings. 我没有得到带有76列的294行,而是得到了带有1列字符串的2940行。

My desired output dataframe would put each set of 10 rows into a single row and then split all of the values by whitespace as a delimiter. 我想要的输出数据帧将每10行的集合放到一行中,然后将所有值都用空格分隔,以作为分隔符。

Unfortunately pd.read_csv isn't very flexible when it comes to custom line endings (they can only be a single character). 不幸的是, pd.read_csv在自定义行结尾时不是很灵活(它们只能是单个字符)。 I would suggest defining your own function to read from the file and yield one "row" at a time, where a row is everything between 'name' s. 我建议定义您自己的函数以从文件中读取并一次产生一个“行”,其中'name'之间'name'所有行都为一行。 For example: 例如:

def my_data_file_reader(file_name):
    with open(file_name) as f:      # read from your datafile
        row = []                    # store incomplete rows here
        for line in f:              # iterate through each line
            line = line.split()
            row.extend(line)        # add each line to row (flattened)
            if line[-1] == 'name':  # yield row and reset it if a line ends with 'name'
                yield row
                row = []  

And then build your dataframe using pd.DataFrame instead of pd.read_csv 然后使用pd.DataFrame而不是pd.read_csv构建数据pd.read_csv

import pandas as pd

df = pd.DataFrame(my_data_file_reader('datafile.data'))

If your 'datafile.data' only contains the two rows given in your example then you can expect df to look something like: 如果您的'datafile.data'仅包含示例中给出的两行,则可以预期df类似于:

print(df)
     0  1   2  3  4  5  6   7  8    9   ...   66  67 68 69 70 71 72   73   74  \
0  1254  0  40  1  1  0  0  -9  2  140  ...   -9  -9  1  1  1  1  1  -9.  -9.
1  1255  0  49  0  1  0  0  -9  3  160  ...   -9  -9  1  1  1  1  1  -9.  -9.

     75
0  name
1  name

[2 rows x 76 columns]
link = 'https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data'

import urllib.request
import io, re


ln1 = " ".join([re.sub('\n', ' ',el) 
               for el 
               in [f'{ch}'
               for ch 
               in urllib.request.urlopen(link)]])

df = pd.read_csv(io.StringIO('\n'.join(re.split('name', ln1))), delim_whitespace=True)

df.head()  

   1254  0  40  1  1.1  0.1  0.2  -9  2  140  ...    -9.26  -9.27  -9.28  1.2  1.3  1.4  1.5  1.6  -9.  -9..1      
0  1255  0  49  0    1    0    0  -9  3  160  ...       -9     -9     -9    1    1    1    1    1 -9.0   -9.0      
1  1256  0  37  1    1    0    0  -9  2  130  ...       -9     -9     -9    1    1    1    1    1 -9.0   -9.0      
2  1257  0  48  0    1    1    1  -9  4  138  ...       -9      2     -9    1    1    1    1    1 -9.0   -9.0      
3  1258  0  54  1    1    0    1  -9  3  150  ...       -9      1     -9    1    1    1    1    1 -9.0   -9.0      
4  1259  0  39  1    1    0    1  -9  3  120  ...       -9     -9     -9    1    1    1    1    1 -9.0   -9.0 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM