Skip multiple rows using pandas.read_csv

Question

I am reading a large csv file in chunks as I don't have enough memory to store. I would like to read its first 10 rows (0 to 9 rows), skip the next 10 rows(10 to 19), then read the next 10 rows( 20 to 29 rows), again skip the next 10 rows(30 to 39) and then read rows from 40 to 49 and so on. Following is the code I am using:

#initializing n1 and n2 variable  
n1=1
n2=2
#reading data in chunks
for chunk in pd.read_csv('../input/train.csv',chunksize=10, dtype=dtypes,skiprows=list(range(  ((n1*10)+1), ((n2*10) +1) ))):
    sample_chunk=chunk
   #displaying the  sample_chunk
   print(sample_chunk)
   #incrementing n1
    n1=n1+2
   #incrementing n2
    n2=n2+2

However, the code does not work as I assume I have designed. It only skip rows from 10 to 19 (ie: It reads rows from 0 to 9, skip 10 to 19, then reads 20 to 29, then again read 30 to 39, then again read 40 to 49, and keep on reading all the rows). Please help me identify what I am doing wrong.

Answer 1

With your method, you need to define the all the skiprows in the time of initialising the pd.read_csv which you can do so,

rowskips = [i for x in range(1,int(lengthOfFile/10),2) for i in range(x*10, (x+1)*10)]

with lengthOfFile being the length of the file.

Then for pd.read_csv

pd.read_csv('../input/train.csv',chunksize=10, dtype=dtypes,skiprows=rowskips)

From the documentation :

skiprows : list-like, int or callable, optional

    Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.

    If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].

So you can pass list , int or callable ,

int -> it skips the given lines at the start of the file
list -> it skips the line number given in list
callable -> it evaluates the line number with the callable and then decides to skip or not.

You were passing list that specifies at the time of initiation, the lines to skip. You cannot update it again. Another way might to be to pass a callable, lamda x: x in rowskips and it will evaluate if a row fits the condition to skip.

Answer 2

code:

ro = list(range(0, lengthOfFile + 10, 10))
d = [j + 1 for i in range(1, len(ro), 2) for j in range(ro[i], ro[i + 1])]
# print(ro)
print(d)

pd.read_csv('../input/train.csv',chunksize=10, dtype=dtypes,skiprows=d)

for example:

lengthOfFile = 100
ro = list(range(0, lengthOfFile + 10, 10))
d = [j for i in range(1, len(ro), 2) for j in range(ro[i], ro[i + 1])]
print(d)

output: [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]

Skip multiple rows using pandas.read_csv

Question

2 answers

solution1
1 2019-02-19 11:43:52

solution2
1 ACCPTED 2019-02-19 12:00:54

Skip multiple rows using pandas.read_csv

Question

2 answers

solution1 1 2019-02-19 11:43:52

solution2 1 ACCPTED 2019-02-19 12:00:54

solution1
1 2019-02-19 11:43:52

solution2
1 ACCPTED 2019-02-19 12:00:54