简体   繁体   English

从数据集中在线提取数据

[英]Extracting data from dataset Online

DataSet Link 数据集链接

I have a dataset that is available in Bupa.data as CSV at the link given above and the attributes data is given in point 7 at Bupa.name file. 在上面给出的链接中,我在Bupa.data中有一个可用的CSV数据集,而属性数据在Bupa.name文件的第7点中给出。

https://archive.ics.uci.edu/ml/machine-learning-databases/liver-disorders/ https://archive.ics.uci.edu/ml/machine-learning-databases/liver-disorders/

I am confused how to combine both the links to create a dataframe from both the links as one contains header information and the other contain the data in csv format. 我很困惑如何组合两个链接以从两个链接创建数据帧,因为一个包含标题信息,另一个包含csv格式的数据。

I am comfortable with python and started with the following code : 我对python很满意,并从以下代码开始:

import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import Request,urlopen
soup_link1 = BeautifulSoup(urlopen('https://archive.ics.uci.edu/ml/machine-learning-databases/liver-disorders/bupa.data'))
soup_link2 = BeautifulSoup(urlopen('https://archive.ics.uci.edu/ml/machine-learning-databases/liver-disorders/bupa.names'))
table_data = soup_link1.find('p')
table_header = soup2_link.find('p')

Please help further. 请进一步帮助。

To create the code that is robust against different attribute names you can use a regular expression to get the data out of the files. 要创建对不同属性名称具有鲁棒性的代码,可以使用正则表达式从文件中获取数据。 In your particular case: 在您的特定情况下:

import pandas as pd
import numpy as np
import re
from bs4 import BeautifulSoup
from urllib.request import Request,urlopen
soup_link1 = BeautifulSoup(urlopen('https://archive.ics.uci.edu/ml/machine-learning-databases/liver-disorders/bupa.data'))
soup_link2 = BeautifulSoup(urlopen('https://archive.ics.uci.edu/ml/machine-learning-databases/liver-disorders/bupa.names'))
table_data = soup_link1.find('p')
table_header = soup2_link.find('p')

p = re.compile(r'(?<=\d\.\s)[a-z]+')
columns = p.findall(table_header.text)
data = list(i.split(',') for i in table_data.text.split('\n'))
df = pd.DataFrame(data, columns=columns).apply(pd.to_numeric, errors='ignore')

The huge problem with such data is that everything is a string so we have to do a lot of converting between string and float. 此类数据的巨大问题在于所有内容都是字符串,因此我们必须在字符串和浮点数之间进行大量转换。

You can combine data from two URLs and create virtual csv file (you can then supply the created csv file to dataframe or otherwise process it): 您可以合并来自两个URL的数据并创建虚拟csv文件(然后可以将创建的csv文件提供给dataframe或进行其他处理):

import requests
import re
import csv
from io import StringIO # for Python2 use: from StringIO import StringIO

data_1 = requests.get('https://archive.ics.uci.edu/ml/machine-learning-databases/liver-disorders/bupa.data').text
data_2 = requests.get('https://archive.ics.uci.edu/ml/machine-learning-databases/liver-disorders/bupa.names').text

row_names = re.findall(r'\d+\.\s+([a-z]+)', data_2)
f = StringIO(','.join('"{}"'.format(v) for v in row_names) + '\n' + data_1)

cr = csv.reader(f, delimiter=',')  # cr is your created csv file

for row in cr:
    print(row)

Outputs: 输出:

['mcv', 'alkphos', 'sgpt', 'sgot', 'gammagt', 'drinks', 'selector']
['85', '92', '45', '27', '31', '0.0', '1']
['85', '64', '59', '32', '23', '0.0', '2']
['86', '54', '33', '16', '54', '0.0', '2']
['91', '78', '34', '24', '36', '0.0', '2']
['87', '70', '12', '28', '10', '0.0', '2']
['98', '55', '13', '17', '17', '0.0', '2']
['88', '62', '20', '17', '9', '0.5', '1']
['88', '67', '21', '11', '11', '0.5', '1']
['92', '54', '22', '20', '7', '0.5', '1']

...and so on.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM