简体   繁体   中英

Pandas read_csv skiprows with conditional statements

I have a bunch of txt files that i need to compile into a single master file. I use read_csv to extract the information inside. There are some rows to drop, and i was wondering if it's possible to use the skiprows feature without specifying the index number of rows that i want to drop, but rather to tell which one to drop according to its row content/value. Here's how the data looks like to illustrate my point.

Index     Column 1          Column 2
0         Rows to drop      Rows to drop
1         Rows to drop      Rows to drop
2         Rows to drop      Rows to drop
3         Rows to keep      Rows to keep
4         Rows to keep      Rows to keep
5         Rows to keep      Rows to keep
6         Rows to keep      Rows to keep
7         Rows to drop      Rows to drop
8         Rows to drop      Rows to drop
9         Rows to keep      Rows to keep
10        Rows to drop      Rows to drop
11        Rows to keep      Rows to keep
12        Rows to keep      Rows to keep
13        Rows to drop      Rows to drop
14        Rows to drop      Rows to drop
15        Rows to drop      Rows to drop

What is the most effective way to do this?

Is this what you want to achieve:

import pandas as pd
df = pd.DataFrame({'A':['row 1','row 2','drop row','row 4','row 5',
                        'drop row','row 6','row 7','drop row','row 9']})

df1 = df[df['A']!='drop row']

print (df)
print (df1)

Original Dataframe:

          A
0     row 1
1     row 2
2  drop row
3     row 4
4     row 5
5  drop row
6     row 6
7     row 7
8  drop row
9     row 9

New DataFrame with rows dropped:

       A
0  row 1
1  row 2
3  row 4
4  row 5
6  row 6
7  row 7
9  row 9

While you cannot skip rows based on content, you can skip rows based on index. Here are some options for you:

skip n number of row:

df = pd.read_csv('xyz.csv', skiprows=2)
#this will skip 2 rows from the top

skip specific rows:

df = pd.read_csv('xyz.csv', skiprows=[0,2,5])
#this will skip rows 1, 3, and 6 from the top
#remember row 0 is the 1st line

skip nth row in the file

#you can also skip by counts. 
#In below example, skip 0th row and every 5th row from there on

def check_row(a):
    if a % 5 == 0:
        return True
    return False

df = pd.read_csv('xyz.txt', skiprows= lambda x:check_row(x))

More details of this can be found in this link about skip rows

No. skiprows will not allow you to drop based on the row content/value.

Based on Pandas Documentation :

skiprows : list-like, int or callable, optional
Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file. If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2] .

Since you cannot do that using skiprows, I could think of this way as efficient:

df = pd.read_csv(filePath)

df = df.loc[df['column1']=="Rows to keep"]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM