简体   繁体   中英

pandas read_csv - Skipping every other row starting from a certain row

I have a huge .csv file (over 1 million rows), that I am trying to parse using the pandas read_csv function. The file is very large because it is measurement data from a sensor with very high sampling rate, and I want to take downsampled segments from it. I tried implementing it with a lambda function and the skiprows and nrows parameter. What my code currently does is, it just reads the same segment over and over again.

segment_amt = 20 # How many segments we want from a individual measurement file
segment_length = 5 # Segment length in seconds
segment_length_idx = fs * segment_length # Segmenth length in indices
segment_skip_length = 10 # How many seconds between segments
segment_skip_idx = fs * segment_skip_length # The amount of indices to skip between each segment
downsampling = 2 # Factor of downsampling

idx = start_idx
for i in range(segment_amt):

    cond = lambda x: (x+idx) % downsampling != 0
    data = pd.read_csv(filename, skiprows=cond, nrows = segment_length_idx/downsampling,
           usecols=[z_component_idx],names=["z"],engine='python')
    M1_df = M1_df.append(data.T)
    idx += segment_skip_idx

This results in something like this . I assume the behaviour is due to the lambda function, but I don't know how to fix it, so that it changes the starting row each time based on idx (this is what I thought it would do currently).

Your cond lambda is wrong. You want to skip rows if x < idx or x % downsampling != 0 . Just write it that way:

cond = x < idx or x % downsampling != 0

But you should also consider passing header = False to avoid processing the first line of each segment as a header.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM