Right now I am trying to read a table which has a variable whitespace delimiter and is also having missing/blank values. I would like to read the table in python and produce a CSV file. I have tried NumPy, Pandas and CSV libraries, but unfortunately both variable space and missing data together are making it near impossible for me to read the table. The file I am trying to read is attached here: goo.gl/z7S2Mo
Would really appreciate if anyone can help me with a solution in python
You need your delimiter to be two spaces or more (instead of one space or more). Here's a solution:
import pandas as pd
df = pd.read_csv('infotable.txt',sep='\s{2,}',header=None,engine='python',thousands=',')
Result:
>>> print(df.head())
0 1 2 3 4 5 \
0 ISHARES MORNINGSTAR MID GROWTH ETP 464288307 3892 41700 SH
1 ISHARES S&P MIDCAP 400 GROWTH ETP 464287606 4700 47600 SH
2 BED BATH & BEYOND Common Stock 075896100 870 15000 SH
3 CARBO CERAMICS INC Common Stock 140781105 950 7700 SH
4 CATALYST HEALTH SOLUTIONS IN Common Stock 14888B103 1313 25250 SH
6 7 8 9
0 Sole 41700 0 0
1 Sole 47600 0 0
2 Sole 15000 0 0
3 Sole 7700 0 0
4 Sole 25250 0 0
>>> print(df.dtypes)
0 object
1 object
2 object
3 int64
4 int64
5 object
6 object
7 int64
8 int64
9 int64
dtype: object
The numpy module has a function to do just that (see last line):
import numpy as np
path = "<insert file path here>/infotable.txt"
# read off column locations from a text editor.
# I used Notepad++ to do that.
column_locations = np.array([1, 38, 52, 61, 70, 78, 98, 111, 120, 127, 132])
# My text editor starts counting at 1, while numpy starts at 0. Fixing that:
column_locations = column_locations - 1
# Get column widths
widths = column_locations[1:] - column_locations[:-1]
data = np.genfromtxt(path, dtype=None, delimiter=widths, autostrip=True)
Depending on your exact use case, you may use a different method to get the column widths but you get the idea. dtype=None
ensures that numpy determines the data types for you; this is very different from leaving out the dtype
argument. Finally, autostrip=True
strips leading and trailing whitespace.
The output ( data
) is a structured array .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.