简体   繁体   中英

Reading fixed-width text file from zipfiles into Pandas dataframe

I'm trying to read text files into Pandas dataframes from inside a zipped archive. The files are formatted like this:

System Time       hh:mm:ss           PPS     Zsec(sec)         Hex Message

Yr=17  Mn= 3 Dy= 3

19:22:59.894      19:22:16        52         69736        7E 32 02 4F 02 00 0C 7F 97 68 10 01 00 11 03 03 13 16 10 34 00 00 00 05 02 00 80 00 83 B1 7E
19:24:12.130      19:23:10       106         69790        7E 32 02 4F 02 00 0C 7F 97 9E 10 01 00 11 03 03 13 17 0A 6A 00 00 00 05 12 00 BA 00 47 DF 7E
19:24:13.241      19:23:11       107         69791        7E 32 02 4F 02 00 0C 7F 97 9F 10 01 00 11 03 03 13 17 0B 6B 00 00 00 05 05 00 BC 00 F3 AC 7E

If the file is extracted outside the archive, I can read it:

data = '../data/test1/heartbeat.txt'
df = pd.read_csv(data, sep='\s{2,}', engine='python', skiprows=4, encoding='utf8',
                 names=['System Time','hh:mm:ss','PPS','Zsec(sec)', 'Hex Message'])

But that approach fails if I try to access it inside the zipfile:

zf = zipfile.ZipFile('../data.zip', 'r')
data = zf.open('data/test1/heartbeat.txt')
df = pd.read_csv(data, sep='\s{2,}', engine='python', skiprows=4, encoding='utf8',
                 names=['System Time','hh:mm:ss','PPS','Zsec(sec)', 'Hex Message'])

I see TypeError: cannot use a string pattern on a bytes-like object

If I use delim_whitespace instead of \\s{2,} it reads the file. So it seems like I'm using zipfile successfully. However, the 'Hex Message' column contains single spaces, which get broken into many columns in the dataframe.

I've also tried using fixed-width column reading, read_fwf , which also works with the extracted file:

data = '../data/test1/heartbeat.txt'
widths = [13,14,10,13,100]
df = pd.read_fwf(data,widths=widths,skiprows=4,
                 names = ['System Time', 'hh:mm:ss', 'PPS', 'Zsec(sec)','Hex Message'])

But that also fails when the file is inside the zip archive: TypeError: a bytes-like object is required, not 'str'

I'm not sure how translate these bytes-like objects from the zipfile into something the Pandas reader can parse.

This is working for me:

zf = zipfile.ZipFile('../data.zip', 'r')
data = io.StringIO(zf.read('data/test1/heartbeat.txt').decode('utf_8'))
df = pd.read_csv(data, sep='\s{2,}', engine='python', skiprows=4, encoding='utf8',
                 names=['System Time','hh:mm:ss','PPS','Zsec(sec)', 'Hex Message'])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM