简体   繁体   中英

iteratively read (tsv) file for Pandas DataFrame

I have some experimental data which looks like this - http://paste2.org/YzJL4e1b (too long to post here). The blocks which are separated by field name lines are different trials of the same experiment - I would like to read everything in a pandas dataframe but have it bin together certain trials (for instance 0,1,6,7 taken together - and 2,3,4,5 taken together in another group). This is because different trials have slightly different conditions and I would like to analyze the results difference between these conditions. I have a list of numbers for different conditions from another file.

Currently I am doing this:

tracker_data = pd.DataFrame
tracker_data = tracker_data.from_csv(bhpath+i+'_wmet.tsv', sep='\t', header=4)
tracker_data['GazePointXLeft'] = tracker_data['GazePointXLeft'].astype(np.float64) 

but this of course just reads everything in one go (including the field name lines) - it would be great if I could nest the blocks somehow which allows me to easily access them via numeric indices...

Do you have any ideas how I could best do this?

You should use read_csv rather than from_csv *:

tracker_data = pd.read_csv(bhpath+i+'_wmet.tsv', sep='\t', header=4)

If you want to join a list of DataFrames like this you could use concat:

trackers = (pd.read_csv(bhpath+i+'_wmet.tsv', sep='\t', header=4) for i in range(?))
df = pd.concat(trackers)

* which I think is deprecated.

I haven't quite got it working, but I think that's because of how I copy/pasted the data. Try this, let me know if it doesn't work.

Using some inspiration from this question

pat = "TimeStamp\tGazePointXLeft\tGazePointYLeft\tValidityLeft\tGazePointXRight\tGazePointYRight\tValidityRight\tGazePointX\tGazePointY\tEvent\n"
with open('rec.txt') as infile:
    header, names, tail = infile.read().partition(pat)

names = names.split()  # get rid of the tabs here
all_data = tail.split(pat)
res = [pd.read_csv(StringIO(x), sep='\t', names=names) for x in all_data]

We read in the whole file so this won't work for huge files, and then partition it based on the known line giving the column names. tail is just a string with the rest of the data so we can split that, again based on the names. There may be a better way than using StringIO, but this should work.

I'm note sure how you want to join the separate blocks together, but this leaves them as a list. You can concat from there however you desire.

For larger files you might want to write a generator to read until you hit the column names and write a new file until you hit them again. Then read those in separately using something like Andy's answer.

A separate question from how to work with the multiple blocks. Assuming you've got the list of Dataframe s, which I've called res , you can use pandas' concat to join them together into a single DataFrame with a MultiIndex (also see the link Andy posted).

In [122]: df = pd.concat(res, axis=1, keys=['a', 'b', 'c'])  # Use whatever makes sense for the keys

In [123]: df.xs('TimeStamp', level=1, axis=1)
Out[123]: 
     a    b    c
0  NaN  NaN  NaN
1  0.0  0.0  0.0
2  3.3  3.3  3.3
3  6.6  6.6  6.6

I ended up doing it iteratively. very very iteratively. Nothing else seems to work.

pat = 'TimeStamp    GazePointXLeft  GazePointYLeft  ValidityLeft    GazePointXRight GazePointYRight ValidityRight   GazePointX  GazePointY  Event'
with open(bhpath+fileid+'_wmet.tsv') as infile:
    eye_data = infile.read().split(pat)
    eye_data = [trial.split('\r\n') for trial in eye_data] # split at '\r'
    for idx, trial in enumerate(eye_data):
        trial = [row.split('\t') for row in trial]
        eye_data[idx] = trial

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM