简体   繁体   中英

Pandas: Read skipping lines that starts with a certain string

I have a Pandas DataFrame like this:

[6 rows x 5 columns]
name     timestamp         value1  state         value2
Cs01  1.514483e+09         19.516      0  9.999954e-01   
Cs02  1.514483e+09         20.055      0  9.999363e-01   
Cs03  1.514483e+09         20.054      0  9.999970e-01   
Cs01  1.514483e+09         20.055      0  9.999949e-01   
Cs01  1.514483e+09         10.907      0  9.963121e-01   
Cs02  1.514483e+09         10.092      0  1.548312e-02  

is it possible with the read_csv function skip all the rows that does not start with the name "Cs01"?

Thank you

The simpliest is filter all rows:

df = pd.read_csv('file')

df = df[df['name'].str.startswith('Cs01')]
print (df)
   name     timestamp  value1  state    value2
0  Cs01  1.514483e+09  19.516      0  0.999995
3  Cs01  1.514483e+09  20.055      0  0.999995
4  Cs01  1.514483e+09  10.907      0  0.996312

Another solution is get all rows not contains Cs01 in preprocessing and use parameter skiprows in read_csv :

exclude = [i for i, line in enumerate(open('file.csv')) if not line.startswith('Cs01')]
print (exclude)
[0, 2, 3, 6]

df = pd.read_csv('file.csv', skiprows = exclude[1:])
print (df)
   name     timestamp  value1  state    value2
0  Cs01  1.514483e+09  19.516      0  0.999995
1  Cs01  1.514483e+09  20.055      0  0.999995
2  Cs01  1.514483e+09  10.907      0  0.996312

One method would be to read the file in chunks and then filter the lines out in the chunks, it's possible this will be faster if you have a large file with a lot of unwanted rows as reading in the entire df and then filtering may be non-performant:

In[17]:
t="""name     timestamp         value1  state         value2
Cs01  1.514483e+09         19.516      0  9.999954e-01   
Cs02  1.514483e+09         20.055      0  9.999363e-01   
Cs03  1.514483e+09         20.054      0  9.999970e-01   
Cs01  1.514483e+09         20.055      0  9.999949e-01   
Cs01  1.514483e+09         10.907      0  9.963121e-01   
Cs02  1.514483e+09         10.092      0  1.548312e-02"""
d = pd.read_csv(io.StringIO(t), delim_whitespace=True, chunksize=2)
dfs = pd.concat([x[x['name'].str.startswith('Cs01')] for x in d])
dfs

Out[17]: 
   name     timestamp  value1  state    value2
0  Cs01  1.514483e+09  19.516      0  0.999995
3  Cs01  1.514483e+09  20.055      0  0.999995
4  Cs01  1.514483e+09  10.907      0  0.996312

Here the chunksize param specifies the number of lines to read, you can set this to some arbritrary size, you then do a list comprehension and filter on each chunk and then call concat to produce a single df

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM