简体   繁体   English

使用奇怪的编码从Python中的.txt网址收集数据

[英]Collecting data from .txt url in Python with strange encoding

I am trying to use Python to collect datasets from a list of URLs. 我正在尝试使用Python从URL列表中收集数据集。 The data is in an ascii .txt format. 数据为ascii .txt格式。 The data come back consistently in a non-readable format like "\\xff\\xfeO\\x00B\\x00S\\x00", where it should be a set of tab-delimited numbers with a header. 数据以“ \\ xff \\ xfeO \\ x00B \\ x00S \\ x00”之类的不可读格式一致地返回,该格式应该是一组由制表符分隔的数字,带有标题。 As an example, this is one of the simplest pages I'm trying to scrape. 例如, 是我要抓取的最简单的页面之一。 The data are from a statistics textbook, and I want to use it to run through the exercises without downloading individual excel files. 数据来自统计教科书,我想用它来完成练习,而无需下载单个excel文件。

I have tried both requests and urllib/urllib2, but they both return the same data. 我已经尝试了两个请求和urllib / urllib2,但是它们都返回相同的数据。 It seems to be coming in as iso-8859-1, but attempts to change the encoding to something like UTF-8, UTF-16, and Latin-1 have all ended up the same. 它似乎以iso-8859-1的形式出现,但是尝试将编码更改为类似UTF-8,UTF-16和Latin-1的方法都以相同的方式结束。 Here is my example code, which at least returns the data structure I'm going for: 这是我的示例代码,至少返回我要使用的数据结构:

import urllib2

url = 'http://wps.aw.com/wps/media/objects/8992/9208383/Data_Sets/Ascii/Chapter1/HTWT1.txt'
raw = urllib2.urlopen(url)

data = []

for row in raw:
    rawData = row.split("\t")
    data.append(rawData)

And my output from this code looks like this: 我的这段代码的输出如下所示:

>>> print(data)
[['\xff\xfeO\x00B\x00S\x00', '\x00X\x00', '\x00Y\x00\r\x00\n'], ['\x001\x00', '\x005\x00', '\x001\x004\x000\x00\r\x00\n'], ['\x002\x00', '\x009\x00', '\x001\x005\x007\x00\r\x00\n'], ['\x003\x00', '\x001\x003\x00', '\x002\x000\x005\x00\r\x00\n'], ['\x004\x00', '\x001\x002\x00', '\x001\x009\x008\x00\r\x00\n'], ['\x005\x00', '\x001\x000\x00', '\x001\x006\x002\x00\r\x00\n'], ['\x006\x00', '\x001\x001\x00', '\x001\x007\x004\x00\r\x00\n'], ['\x007\x00', '\x008\x00', '\x001\x005\x000\x00\r\x00\n'], ['\x008\x00', '\x009\x00', '\x001\x006\x005\x00\r\x00\n'], ['\x009\x00', '\x001\x000\x00', '\x001\x007\x000\x00\r\x00\n'], ['\x001\x000\x00', '\x001\x002\x00', '\x001\x008\x000\x00\r\x00\n'], ['\x001\x001\x00', '\x001\x001\x00', '\x001\x007\x000\x00\r\x00\n'], ['\x001\x002\x00', '\x009\x00', '\x001\x006\x002\x00\r\x00\n'], ['\x001\x003\x00', '\x001\x000\x00', '\x001\x006\x005\x00\r\x00\n'], ['\x001\x004\x00', '\x001\x002\x00', '\x001\x008\x000\x00\r\x00\n'], ['\x001\x005\x00', '\x008\x00', '\x001\x006\x000\x00\r\x00\n'], ['\x001\x006\x00', '\x009\x00', '\x001\x005\x005\x00\r\x00\n'], ['\x001\x007\x00', '\x001\x000\x00', '\x001\x006\x005\x00\r\x00\n'], ['\x001\x008\x00', '\x001\x005\x00', '\x001\x009\x000\x00\r\x00\n'], ['\x001\x009\x00', '\x001\x003\x00', '\x001\x008\x005\x00\r\x00\n'], ['\x002\x000\x00', '\x001\x001\x00', '\x001\x005\x005\x00\r\x00\n'], ['\x00']]

How can I get the data in a usable format? 如何获得可用格式的数据? Using curl seems to return the right content format, but I'd prefer to keep things Pythonic as much as possible. 使用curl似乎可以返回正确的内容格式,但我希望尽可能保留Pythonic。

For reference, I'm using Python 2.7.9 out of habit (working on moving to 3), but can use 3 if that makes things easier. 供参考,出于习惯,我正在使用Python 2.7.9(可以继续使用3),但是如果使用3,可以使事情变得更容易。

I don't know if this is the best way to do it but it gets the results you want. 我不知道这是否是最好的方法,但是它会得到您想要的结果。 So if anyone has a better approach just share it. 因此,如果有人有更好的方法,那就分享一下。

Here it is: 这里是:

import requests

URL = "http://wps.aw.com/wps/media/objects/8992/9208383/Data_Sets/Ascii/Chapter1/HTWT1.txt"

response = requests.get(URL)

data = dict()

text = response.content.decode('ISO-8859-1').encode('utf-8').replace('\x00', '').strip()[2:]
for row in text[2:].splitlines()[1:]:
    OBS, x, y = row.split('\t')
    data[int(OBS)] = dict(x=int(x), y=int(y))

print data

Output: 输出:

{
    1: {
        'y': 140,
        'x': 5
    },
    2: {
        'y': 157,
        'x': 9
    },
    3: {
        'y': 205,
        'x': 13
    },
    4: {
        'y': 198,
        'x': 12
    },
    5: {
        'y': 162,
        'x': 10
    },
    6: {
        'y': 174,
        'x': 11
    },
    7: {
        'y': 150,
        'x': 8
    },
    8: {
        'y': 165,
        'x': 9
    },
    9: {
        'y': 170,
        'x': 10
    },
    10: {
        'y': 180,
        'x': 12
    },
    11: {
        'y': 170,
        'x': 11
    },
    12: {
        'y': 162,
        'x': 9
    },
    13: {
        'y': 165,
        'x': 10
    },
    14: {
        'y': 180,
        'x': 12
    },
    15: {
        'y': 160,
        'x': 8
    },
    16: {
        'y': 155,
        'x': 9
    },
    17: {
        'y': 165,
        'x': 10
    },
    18: {
        'y': 190,
        'x': 15
    },
    19: {
        'y': 185,
        'x': 13
    },
    20: {
        'y': 155,
        'x': 11
    }
}

ADDED: 添加:

If you want some code to parse that specific txt format, you can use a more generic script like the one below. 如果您想要一些代码来解析该特定的txt格式,则可以使用以下通用脚本。 You would only need to change the headers list according to the txt file headers (without OBS): 您只需要根据txt文件标题(不带OBS)来更改标题列表:

import requests

def wrapper(thelist):
    return thelist[0], thelist[1:]

# URL = "http://wps.aw.com/wps/media/objects/8992/9208383/Data_Sets/Ascii/Chapter1/HTWT1.txt"
URL = "http://wps.aw.com/wps/media/objects/8992/9208383/Data_Sets/Ascii/Chapter7/CARS7.txt"

response = requests.get(URL)

data = dict()

# headers = ['X', 'Y']
headers = ['Make', 'Model', 'Time', 'Speed', 'Top', 'Weight', 'HP'] # Must be in order and without OBS

text = response.content.decode('ISO-8859-1').encode('utf-8').replace('\x00', '').strip()[2:]
for row in text[2:].splitlines()[1:]:
    OBS, extras = wrapper(row.split('\t'))
    helper_dict = dict()

    for extra in extras:
        header = headers[extras.index(extra)]
        helper_dict[header] = extra
    data[int(OBS)] = helper_dict

print data

Output: 输出:

{
    1: {
        'Weight': '1335',
        'Make': 'Audi',
        'Time': '8.9',
        'HP': '150',
        'Model': 'TT Roadster',
        'Speed': '133',
        'Top': '0'
    },
    2: {
        'Weight': '1240',
        'Make': 'Mini ',
        'Time': '7.4',
        'HP': '168',
        'Model': 'Cooper S',
        'Speed': '134',
        'Top': '0'
    },
    3: {
        'Weight': '1711',
        'Make': 'Volvo',
        'Time': '7.4',
        'HP': '220',
        'Model': 'C70 T5 Sport',
        'Speed': '150',
        'Top': '0'
    },
    4: {
        'Weight': '1680',
        'Make': 'Saab',
        'Time': '7.9',
        'HP': '247',
        'Model': ' Nine-Three ',
        'Speed': '149',
        'Top': '0'
    },
    5: {
        'Weight': '1825',
        'Make': 'Mercedes-Benz',
        'Time': '6.6',
        'HP': '268',
        'Model': 'SL350',
        'Speed': '155',
        'Top': '0'
    },
    6: {
        'Weight': '1703',
        'Make': 'Jaguar',
        'Time': '6.7',
        'HP': '290',
        'Model': 'XK8',
        'Speed': '154',
        'Top': '0'
    },
    7: {
        'Weight': '1950',
        'Make': 'Bugatti',
        'Time': '2.4',
        'HP': '1000',
        'Model': 'Veyron 16.4',
        'Speed': '253',
        'Top': '1'
    },
    8: {
        'Weight': '875',
        'Make': 'Lotus',
        'Time': '4.9',
        'HP': '189',
        'Model': 'Exige',
        'Speed': '147',
        'Top': '1'
    },
    9: {
        'Weight': '1257',
        'Make': 'BMW',
        'Time': '6.7',
        'HP': '220',
        'Model': 'M3 (E30)',
        'Speed': '144',
        'Top': '1'
    },
    10: {
        'Weight': '1510',
        'Make': 'BMW',
        'Time': '5.9',
        'HP': '231',
        'Model': '330i Sport',
        'Speed': '155',
        'Top': '1'
    },
    11: {
        'Weight': '1350',
        'Make': 'Porsche',
        'Time': '5.3',
        'HP': '291',
        'Model': 'Cayman S',
        'Speed': '171',
        'Top': '1'
    },
    12: {
        'Weight': '1560',
        'Make': 'Nissan',
        'Time': '4.7',
        'HP': '276',
        'Model': 'Skyline GT-R (R34)',
        'Speed': '165',
        'Top': '1'
    },
    13: {
        'Weight': '1270',
        'Make': 'Porsche',
        'Time': '4.7',
        'HP': '300',
        'Model': '911 RS',
        'Speed': '172',
        'Top': '1'
    },
    14: {
        'Weight': '1584',
        'Make': 'Ford',
        'Time': '5',
        'HP': '319',
        'Model': 'Shelby GT',
        'Speed': '150',
        'Top': '1'
    },
    15: {
        'Weight': '1260',
        'Make': 'Mitsubishi',
        'Time': '4.4',
        'HP': '320',
        'Model': 'Evo VII RS Sprint',
        'Speed': '150',
        'Top': '1'
    },
    16: {
        'Weight': '1630',
        'Make': 'Aston Martin',
        'Time': '5.2',
        'HP': '380',
        'Model': 'V8 Vantage',
        'Speed': '175',
        'Top': '1'
    },
    17: {
        'Weight': '1540',
        'Make': 'Mercedes-Benz',
        'Time': '4.8',
        'HP': '355',
        'Model': 'SLK55 AMG',
        'Speed': '155',
        'Top': '1'
    },
    18: {
        'Weight': '1930',
        'Make': 'Maserati',
        'Time': '5.1',
        'HP': '394',
        'Model': 'Quattroporte Sport GT',
        'Speed': '171',
        'Top': '1'
    },
    19: {
        'Weight': '1275',
        'Make': 'Spyker',
        'Time': '4.5',
        'HP': '400',
        'Model': 'C8',
        'Speed': '187',
        'Top': '1'
    },
    20: {
        'Weight': '1161',
        'Make': 'Ferrari',
        'Time': '4.9',
        'HP': '400',
        'Model': '288GTO',
        'Speed': '189',
        'Top': '1'
    },
    21: {
        'Weight': '1130',
        'Make': 'Mosler',
        'Time': '3.9',
        'HP': '435',
        'Model': 'MT900',
        'Speed': '190',
        'Top': '1'
    },
    22: {
        'Weight': '1447',
        'Make': 'Lamborghini',
        'Time': '4.9',
        'HP': '455',
        'Model': 'Countach QV',
        'Speed': '180',
        'Top': '1'
    },
    23: {
        'Weight': '1290',
        'Make': 'Chrysler',
        'Time': '4',
        'HP': '460',
        'Model': 'Viper GTS-R',
        'Speed': '190',
        'Top': '1'
    },
    24: {
        'Weight': '2585',
        'Make': 'Bentley',
        'Time': '5.2',
        'HP': '500',
        'Model': 'Arnage T',
        'Speed': '179',
        'Top': '1'
    },
    25: {
        'Weight': '1350',
        'Make': 'Ferrari',
        'Time': '3.5',
        'HP': '503',
        'Model': '430 Scuderia',
        'Speed': '198',
        'Top': '1'
    },
    26: {
        'Weight': '1247',
        'Make': 'Saleen',
        'Time': '3.3',
        'HP': '550',
        'Model': 'S7',
        'Speed': '240',
        'Top': '1'
    },
    27: {
        'Weight': '1650',
        'Make': 'Lamborghini',
        'Time': '4',
        'HP': '570',
        'Model': 'Murcielago',
        'Speed': '205',
        'Top': '1'
    },
    28: {
        'Weight': '1230',
        'Make': 'Pagani',
        'Time': '3.6',
        'HP': '602',
        'Model': 'Zonda F',
        'Speed': '214',
        'Top': '1'
    },
    29: {
        'Weight': '1140',
        'Make': 'McLaren',
        'Time': '3.2',
        'HP': '627',
        'Model': 'F1',
        'Speed': '240',
        'Top': '1'
    },
    30: {
        'Weight': '1180',
        'Make': 'Koenigsegg ',
        'Time': '3.2',
        'HP': '806',
        'Model': 'CCR',
        'Speed': '242',
        'Top': '1'
    }
}

try Python 3 试用Python 3

http_pool = urllib3.connection_from_url(url)
# Submit request, and write data locally
response = http_pool.urlopen('GET', url)

with open('local.txt', 'wb') as f:
    f.write(response.data)

Python 2 - (untested) Python 2-(未经测试)

req = urllib2.Request(url)
response = urllib2.urlopen(req)
the_page = response.read()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM