简体   繁体   English

熊猫:读取文件时跳过包含特定字符串的行

[英]Pandas: skip lines containing a certain string when reading a file

I have a big text file (300000000 rows) but it is full of undesired data, wich I would like to remove. 我有一个大文本文件(300000000行),但其中充满了不想要的数据,但我想将其删除。 Those data are the one containing the string "0000e". 这些数据是包含字符串“ 0000e”的数据。

I tried: 我试过了:

f=pd.read_csv('File.txt', skiprows=139, header=None, index_col=False)
f=f.iloc[:,0]
f1=f[f.str.contains("0000e")==False]

and

f=pd.read_csv('file.txt', skiprows=139, header=None, index_col=False, chunksize=50)
dfs = pd.concat([x[x[0].str.endswith('000e')==False] for x in f])

but it is rather long, is there a faster way to skip some lines containing a certain string? 但是很长,有没有一种更快的方法来跳过某些包含特定字符串的行? Peraps with na_values ? 与na_values结合?

I prefer your first attempt more as it is definitely more readable atop the fact that your second line has x's and I don't know what they refer to. 我更喜欢您的第一次尝试,因为它的绝对可读性更高,因为您的第二行带有x并且我不知道它们指的是什么。

That said, using memory_map=True will boost the performance as noted in the docs , you can also gain an extra advantage by removing the second line and accessing the column in the same line you create the df . 就是说,使用memory_map=True将提高文档中所述的性能,您还可以通过删除第二行并在创建df的同一行中访问该列来获得额外的优势。 Lastly, replacing the check ...==False with ~... may provide some benefit. 最后,用~...代替检查...==False可能会带来一些好处。 as ~ is a logical not but you need to filter out all the NaN values or you get an error. 由于~是逻辑not ,但你需要过滤掉所有NaN值,或者你会得到一个错误。 Luckily Series.str.contains accepts and na attribute appliying the given function to NaN values. 幸运的是, Series.str.contains接受和na属性,将给定函数应用于NaN值。

import pandas as pd

df = pd.read_csv('File.txt', memory_map=True, header=None, index_col=False).iloc[:,0]
df1 = df[~df.str.contains("test", na=False)]
#if you want to also skip NaN rows use the below statement
df1 = df[~df.str.contains("test", na=False)].dropna()

Alternatively, doing this using csv is much faster even if you decide to load it into pandas afterwards. 另外,即使您决定以后再将其加载到熊猫中, 使用csv进行此操作也要快得多 I don't know what your data looks like but I tested these with a csv file cointaining 3 columns and 100 rows and I got roughly 9x better performance. 我不知道您的数据是什么样子,但我使用包含3列100行的csv文件测试了这些数据,性能提高了大约9倍。 This probably won't correlate to you're results but this is definitely the method I would choose if I were you. 这可能与您的结果无关,但这绝对是如果您是我会选择的方法。

from csv import reader

filter = '0000e' #so we aren't making a new string every iteration
with open('File.txt', 'r') as f:
  df = pd.DataFrame(first for first, *_ in reader(f) if filter not in first)
  #if you want to skip NaN rows
  ...(first for first, *_ in reader(f) if not first and filter not in first)
  #take note this also skips empty strings, use if first is not None for only skipping NaN values

if you have access to a linux or mac os system, you can do this in a pre-processing step that is probably much faster with grep -v , which returns all lines that do not match 如果可以访问linux或mac os系统,则可以在预处理步骤中执行此操作,而使用grep -v可能会更快,该步骤将返回所有匹配的行

grep -v 0000e File.txt > small_file.txt

on windows (I think) it's findstring /v 在Windows上(我认为)是findstring /v

findstring /v File.txt > small_file.txt

you can call the os command from inside your python code, see here 您可以从python代码内部调用os命令,请参见此处

and if you want to make it able to handle multiple os'es, see here 如果您想使其能够处理多个操作系统,请参见此处

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM