从数据集中在线提取数据

Question

I have a dataset that is available in Bupa.data as CSV at the link given above and the attributes data is given in point 7 at Bupa.name file. 在上面给出的链接中，我在Bupa.data中有一个可用的CSV数据集，而属性数据在Bupa.name文件的第7点中给出。

https://archive.ics.uci.edu/ml/machine-learning-databases/liver-disorders/ https://archive.ics.uci.edu/ml/machine-learning-databases/liver-disorders/

I am confused how to combine both the links to create a dataframe from both the links as one contains header information and the other contain the data in csv format. 我很困惑如何组合两个链接以从两个链接创建数据帧，因为一个包含标题信息，另一个包含csv格式的数据。

I am comfortable with python and started with the following code : 我对python很满意，并从以下代码开始：

import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import Request,urlopen
soup_link1 = BeautifulSoup(urlopen('https://archive.ics.uci.edu/ml/machine-learning-databases/liver-disorders/bupa.data'))
soup_link2 = BeautifulSoup(urlopen('https://archive.ics.uci.edu/ml/machine-learning-databases/liver-disorders/bupa.names'))
table_data = soup_link1.find('p')
table_header = soup2_link.find('p')

Please help further. 请进一步帮助。

Answer 1

To create the code that is robust against different attribute names you can use a regular expression to get the data out of the files. 要创建对不同属性名称具有鲁棒性的代码，可以使用正则表达式从文件中获取数据。 In your particular case: 在您的特定情况下：

import pandas as pd
import numpy as np
import re
from bs4 import BeautifulSoup
from urllib.request import Request,urlopen
soup_link1 = BeautifulSoup(urlopen('https://archive.ics.uci.edu/ml/machine-learning-databases/liver-disorders/bupa.data'))
soup_link2 = BeautifulSoup(urlopen('https://archive.ics.uci.edu/ml/machine-learning-databases/liver-disorders/bupa.names'))
table_data = soup_link1.find('p')
table_header = soup2_link.find('p')

p = re.compile(r'(?<=\d\.\s)[a-z]+')
columns = p.findall(table_header.text)
data = list(i.split(',') for i in table_data.text.split('\n'))
df = pd.DataFrame(data, columns=columns).apply(pd.to_numeric, errors='ignore')

The huge problem with such data is that everything is a string so we have to do a lot of converting between string and float. 此类数据的巨大问题在于所有内容都是字符串，因此我们必须在字符串和浮点数之间进行大量转换。

Answer 2

You can combine data from two URLs and create virtual csv file (you can then supply the created csv file to dataframe or otherwise process it): 您可以合并来自两个URL的数据并创建虚拟csv文件（然后可以将创建的csv文件提供给dataframe或进行其他处理）：

import requests
import re
import csv
from io import StringIO # for Python2 use: from StringIO import StringIO

data_1 = requests.get('https://archive.ics.uci.edu/ml/machine-learning-databases/liver-disorders/bupa.data').text
data_2 = requests.get('https://archive.ics.uci.edu/ml/machine-learning-databases/liver-disorders/bupa.names').text

row_names = re.findall(r'\d+\.\s+([a-z]+)', data_2)
f = StringIO(','.join('"{}"'.format(v) for v in row_names) + '\n' + data_1)

cr = csv.reader(f, delimiter=',')  # cr is your created csv file

for row in cr:
    print(row)

Outputs: 输出：

['mcv', 'alkphos', 'sgpt', 'sgot', 'gammagt', 'drinks', 'selector']
['85', '92', '45', '27', '31', '0.0', '1']
['85', '64', '59', '32', '23', '0.0', '2']
['86', '54', '33', '16', '54', '0.0', '2']
['91', '78', '34', '24', '36', '0.0', '2']
['87', '70', '12', '28', '10', '0.0', '2']
['98', '55', '13', '17', '17', '0.0', '2']
['88', '62', '20', '17', '9', '0.5', '1']
['88', '67', '21', '11', '11', '0.5', '1']
['92', '54', '22', '20', '7', '0.5', '1']

...and so on.

从数据集中在线提取数据

问题描述

2 个解决方案

解决方案1
0 2018-07-29 14:31:15

解决方案2
0 2018-07-29 14:44:33

从数据集中在线提取数据

问题描述

2 个解决方案

解决方案1 0 2018-07-29 14:31:15

解决方案2 0 2018-07-29 14:44:33

解决方案1
0 2018-07-29 14:31:15

解决方案2
0 2018-07-29 14:44:33