简体   繁体   中英

Reading a variable white space delimited table in python

Right now I am trying to read a table which has a variable whitespace delimiter and is also having missing/blank values. I would like to read the table in python and produce a CSV file. I have tried NumPy, Pandas and CSV libraries, but unfortunately both variable space and missing data together are making it near impossible for me to read the table. The file I am trying to read is attached here: goo.gl/z7S2Mo

这就是表格的样子

Would really appreciate if anyone can help me with a solution in python

You need your delimiter to be two spaces or more (instead of one space or more). Here's a solution:

import pandas as pd
df = pd.read_csv('infotable.txt',sep='\s{2,}',header=None,engine='python',thousands=',')

Result:

>>> print(df.head())
                                0             1          2     3      4   5  \
0  ISHARES MORNINGSTAR MID GROWTH           ETP  464288307  3892  41700  SH   
1   ISHARES S&P MIDCAP 400 GROWTH           ETP  464287606  4700  47600  SH   
2               BED BATH & BEYOND  Common Stock  075896100   870  15000  SH   
3              CARBO CERAMICS INC  Common Stock  140781105   950   7700  SH   
4    CATALYST HEALTH SOLUTIONS IN  Common Stock  14888B103  1313  25250  SH   

      6      7  8  9  
0  Sole  41700  0  0  
1  Sole  47600  0  0  
2  Sole  15000  0  0  
3  Sole   7700  0  0  
4  Sole  25250  0  0  

>>> print(df.dtypes)
0    object
1    object
2    object
3     int64
4     int64
5    object
6    object
7     int64
8     int64
9     int64
dtype: object

The numpy module has a function to do just that (see last line):

import numpy as np

path = "<insert file path here>/infotable.txt"

# read off column locations from a text editor.
# I used Notepad++ to do that.
column_locations = np.array([1, 38, 52, 61, 70, 78, 98, 111, 120, 127, 132])

# My text editor starts counting at 1, while numpy starts at 0. Fixing that:
column_locations = column_locations - 1

# Get column widths
widths = column_locations[1:] - column_locations[:-1]

data = np.genfromtxt(path, dtype=None, delimiter=widths, autostrip=True)

Depending on your exact use case, you may use a different method to get the column widths but you get the idea. dtype=None ensures that numpy determines the data types for you; this is very different from leaving out the dtype argument. Finally, autostrip=True strips leading and trailing whitespace.

The output ( data ) is a structured array .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM