简体   繁体   English

读取具有固定宽度列的 txt 文件

[英]Read in txt file with fixed width columns

I am trying to open the a dat.txt file from the following website: http://jse.amstat.org/datasets/04cars.dat.txt我正在尝试从以下网站打开 dat.txt 文件: http ://jse.amstat.org/datasets/04cars.dat.txt

And I am not sure which delimiter to use to read it into python as it is separated by spaces.而且我不确定使用哪个分隔符将它读入 python,因为它用空格分隔。

I tried pd.read_csv('http://jse.amstat.org/datasets/04cars.dat.txt', delimiter = 'sp') along with a several other things but nothing seems to work, as well as:我尝试了pd.read_csv('http://jse.amstat.org/datasets/04cars.dat.txt', delimiter = 'sp')以及其他一些东西,但似乎没有任何效果,以及:

np.genfromtxt("http://jse.amstat.org/datasets/04cars.dat.txt", delimiter= 'sp')

Note the zeros and ones each represent a separate column.请注意,零和一分别代表一个单独的列。

Use read_fwf instead of read_csv .使用read_fwf而不是read_csv

[ read_fwf reads] a table of fixed-width formatted lines into DataFrame. [ read_fwf读取] 固定宽度格式化行的表格到 DataFrame 中。
https://pandas.pydata.org/docs/reference/api/pandas.read_fwf.html https://pandas.pydata.org/docs/reference/api/pandas.read_fwf.html

import pandas as pd

colspecs = (
    (0, 44),
    (46, 47),
    (48, 49),
    (50, 51),
    (52, 53),
    (54, 55),
    (56, 57),
    (58, 59),
    (60, 66),
    (67, 73),
    (74, 77),
    (78, 80),
    (81, 84),
    (85, 87),
    (88, 90),
    (91, 95),
    (96, 99),
    (100, 103),
    (104, 106),
)
data_url = "http://jse.amstat.org/datasets/04cars.dat.txt"

df = pd.read_fwf(data_url, colspecs=colspecs)
df.columns = (
    "Vehicle Name",
    "Is Sports Car",
    "Is SUV",
    "Is Wagon",
    "Is Minivan",
    "Is Pickup",
    "Is All-Wheel Drive",
    "Is Rear-Wheel Drive",
    "Suggested Retail Price",
    "Dealer Cost",
    "Engine Size (litres)",
    "Number of Cylinders",
    "Horsepower",
    "City Miles Per Gallon",
    "Highway Miles Per Gallon",
    "Weight (pounds)",
    "Wheel Base (inches)",
    "Lenght (inches)",
    "Width (inches)",
)

And the output for print(df) would be: print(df)的输出将是:

                        Vehicle Name  ...  Width (inches)
0        Chevrolet Aveo LS 4dr hatch  ...              66
1             Chevrolet Cavalier 2dr  ...              69
2             Chevrolet Cavalier 4dr  ...              68
3          Chevrolet Cavalier LS 2dr  ...              69
4                  Dodge Neon SE 4dr  ...              67
..                               ...  ...             ...
422         Nissan Titan King Cab XE  ...               *
423                      Subaru Baja  ...               *
424                    Toyota Tacoma  ...               *
425     Toyota Tundra Regular Cab V6  ...               *
426  Toyota Tundra Access Cab V6 SR5  ...               *

[427 rows x 19 columns]

Column names and specifications retrieved from here:从此处检索的列名称和规范:


Note: Don't forget to specify where each column starts and ends.注意:不要忘记指定每列的开始和结束位置。 Without using colspecs , pandas is making an assumption based on the first row which leads to data errors.在不使用colspecspandas会根据导致数据错误的第一行做出假设。 Below an extract of a unified diff between generated csv files (with specs and without):下面是生成的csv文件(带规格和不带规格)之间统一差异的摘录:

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM