简体   繁体   中英

eliminate text after certain character in python pipeline- with slice?

This is a short script I've written to refine and validate a large dataset that I have.

# The purpose of this script is the refinement of the job data attained from the
# JSI as it is rendered by the `csv generator` contributed by Luis for purposes
# of presentation on the dashboard map. 

import csv

# The number of columns
num_headers = 9

# Remove invalid characters from records
def url_escaper(data):
  for line in data:
    yield line.replace('&','&')

# Be sure to configure input & output files
with open("adzuna_input_THRESHOLD.csv", 'r') as file_in, open("adzuna_output_GO.csv", 'w') as file_out:
    csv_in = csv.reader( url_escaper( file_in ) )
    csv_out = csv.writer(file_out)

    # Get rid of rows that have the wrong number of columns
    # and rows that have only whitespace for a columnar value
    for i, row in enumerate(csv_in, start=1):
        if not [e for e in row if not e.strip()]:
            if len(row) == num_headers:
                csv_out.writerow(row)
        else:
            print "line %d is malformed" % i

I have one field that is structured like so:

finance|statistics|lisp

I've seen ways to do this using other utilities like R , but I want to ideally achieve the same effect within the scope of this python code.

Maybe I can iterate over all the characters of all the columnar values, perhaps as a list, and if I see a | I can dispose of the | and all the text that follows it within the scope of the column value.

I think surely it can be achieved with slices as they do here but I don't quite understand how the indices with slices work- and I can't see how I could include this process harmoniously within the cascade of the current script pipeline.

With regex I guess it's something like this

(?:|)(.*) 

Why not use string's split method?

In[4]: 'finance|statistics|lisp'.split('|')[0]
Out[4]: 'finance'

It does not fail with exception when you do not have separator character in the string too:

In[5]: 'finance/statistics/lisp'.split('|')[0]
Out[5]: 'finance/statistics/lisp'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM