简体   繁体   中英

Python how to filter string based on substring

Im new to Python coming from Java world.

  1. I'm trying to write a simple python function that prints out only the data rows of a CSV or "arff" file. The non data rows begin with these 3 patterns @ , [@ , [%, and such rows should not be printed.

  2. Example data file snippet:

     % 1. Title: Iris Plants Database % % 2. Sources: % (a) Creator: RA Fisher % (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov) % (c) Date: July, 1988 @RELATION iris @ATTRIBUTE sepallength REAL @ATTRIBUTE sepalwidth REAL @ATTRIBUTE petallength REAL @ATTRIBUTE petalwidth REAL @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} @DATA 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,Iris-setosa 5.4,3.9,1.7,0.4,Iris-setosa 

Python script:

import csv
def loadCSVfile (path):
    csvData = open(path, 'rb') 
    spamreader = csv.reader(csvData, delimiter=',', quotechar='|')
    for row in spamreader:
        if row.__len__ > 0:
            #search the string from index 0 to 2 and if these substrings(@ ,'[\'%' , '[\'@') are not found, than print the row
            if (str(row).find('@',0,1) & str(row).find('[\'%',0,2) & str(row).find('[\'@',0,2) != 1):
                print str(row)

loadCSVfile('C:/Users/anaim/Desktop/Data Mining/OneR/iris.arff')

actual output:

['% 1. Title: Iris Plants Database']
['% ']
['% 2. Sources:']
['%      (a) Creator: R.A. Fisher']
['%      (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)']
['%      (c) Date: July', ' 1988']
['% ']
[]
['@RELATION iris']
[]
['@ATTRIBUTE sepallength\tREAL']
['@ATTRIBUTE sepalwidth \tREAL']
['@ATTRIBUTE petallength \tREAL']
['@ATTRIBUTE petalwidth\tREAL']
['@ATTRIBUTE class \t{Iris-setosa', 'Iris-versicolor', 'Iris-virginica}']
[]
['@DATA']
['5.1', '3.5', '1.4', '0.2', 'Iris-setosa']
['4.9', '3.0', '1.4', '0.2', 'Iris-setosa']
['4.7', '3.2', '1.3', '0.2', 'Iris-setosa']
['4.6', '3.1', '1.5', '0.2', 'Iris-setosa']
['5.0', '3.6', '1.4', '0.2', 'Iris-setosa']
['5.4', '3.9', '1.7', '0.4', 'Iris-setosa']
['4.6', '3.4', '1.4', '0.3', 'Iris-setosa']
['5.0', '3.4', '1.5', '0.2', 'Iris-setosa']

Desired output:

['5.1', '3.5', '1.4', '0.2', 'Iris-setosa']
['4.9', '3.0', '1.4', '0.2', 'Iris-setosa']
['4.7', '3.2', '1.3', '0.2', 'Iris-setosa']
['4.6', '3.1', '1.5', '0.2', 'Iris-setosa']
['5.0', '3.6', '1.4', '0.2', 'Iris-setosa']
['5.4', '3.9', '1.7', '0.4', 'Iris-setosa']
['4.6', '3.4', '1.4', '0.3', 'Iris-setosa']
['5.0', '3.4', '1.5', '0.2', 'Iris-setosa']

To test if a row was empty, just use it in a boolean context; empty lists are false.

To test if a string starts with some specific characters, use str.startswith() , which can take either a single string or a tuple of strings:

import csv
def loadCSVfile (path):
    with open(path, 'rb') as csvData:
        spamreader = csv.reader(csvData, delimiter=',', quotechar='|')
        for row in spamreader:
            if row and not row[0].startswith(('%', '@')):
                print row

Because you are really testing for fixed-width character strings, you can also just slice the first column and test with in against a sequence; a set would be most efficient:

def loadCSVfile (path):
    ignore = {'@', '%'}
    with open(path, 'rb') as csvData:
        spamreader = csv.reader(csvData, delimiter=',', quotechar='|')
        for row in spamreader:
            if row and not row[0][:1] in ignore:
                print row

Here the [:1] slice notation returns the first character of the row[0] column (or an empty string if that first column is empty).

I used the open file object as a context manager ( with ... as ... ) so that Python automatically closes the file for us when the code block is done (or an exception is raised).

You should never call double-underscore methods ("dunder" methods, or special methods) directly, the proper API call would be len(row) instead.

Demo:

>>> loadCSVfile('/tmp/iris.arff')
['5.1', '3.5', '1.4', '0.2', 'Iris-setosa']
['4.9', '3.0', '1.4', '0.2', 'Iris-setosa']
['4.7', '3.2', '1.3', '0.2', 'Iris-setosa']
['4.6', '3.1', '1.5', '0.2', 'Iris-setosa']
['5.0', '3.6', '1.4', '0.2', 'Iris-setosa']
['5.4', '3.9', '1.7', '0.4', 'Iris-setosa']

I would take advantage of the in operator and of Python list comprehension.

Here is what I mean:

import csv

def loadCSVfile (path):
    exclusions = ['@', '%', '\n', '[@' , '[%']
    csvData = open(path, 'r')
    spamreader = csv.reader(csvData, delimiter=',', quotechar='|')      

    lines = [line for line in spamreader if ( line and line[0][0:1] not in exclusions and line[0][0:2] not in exclusions )]

    for line in lines:
        print(line)


loadCSVfile('C:/Users/anaim/Desktop/Data Mining/OneR/iris.arff')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM