简体   繁体   中英

Skipping certain characters in a CSV file

I am writing a script to parse a NASDAQ file of every company listed under the technology category. It's a CSV separated by commas. However, sometimes a company has their name listed as XXX, Inc. That comma messes up my tabulation in the script so it gets the wrong value. I'm parsing for the company stock symbol, so the ', Inc.' will mess up the places.

I'm fairly new to Python, so I'm not experienced in it much but I have been doing the best I can and have gotten it to read and write CSVs, but this parsing issue is difficult for me. This is what I currently have:

try:
    # py3
    from urllib.request import Request, urlopen
    from urllib.parse import urlencode
except ImportError:
    # py2
    from urllib2 import Request, urlopen
    from urllib import urlencode

import csv
import urllib.request
import string

def _request():
    url = 'http://www.nasdaq.com/screening/companies-by-industry.aspx?industry=Technology&render=download'
    req = Request(url)
    resp = urlopen(req)
    content = resp.read().decode().strip()
    content1 = content.replace('"', '')
    return content1

def symbol_quote():
    counter = 1
    recursive = 9*counter

    values = _request().split(',')
    values2 = values[recursive]
    return values2
    counter += 1


def csvwrite():
    import csv
    path = "symbol_comp.csv"
    data = [symbol_quote()]
    parsing = False

    with open(path, 'w', newline='') as csv_file:
        writer = csv.writer(csv_file, delimiter=' ')
        for line in data:
            writer.writerow(line)

I haven't made it so it loops and acts according to a counter yet because there's no point right now. This parsing issue is more pressing.

Could anyone please help a newbie out?

Change _request() to use csv.reader() with cStringIO.StringIO() and return a csv.reader object that you can iterate over:

try:
    # py3
    from urllib.request import Request, urlopen
    from urllib.parse import urlencode
except ImportError:
    # py2
    from urllib2 import Request, urlopen
    from urllib import urlencode

import csv, cStringIO
##import urllib.request
import string

def _request():
    url = 'http://www.nasdaq.com/screening/companies-by-industry.aspx?industry=Technology&render=download'
    req = Request(url)
    resp = urlopen(req)
    sio = cStringIO.StringIO(resp.read().decode().strip())
    reader = csv.reader(sio)
    return reader

Usage:

data = _request()
print 'fields:\n{}\n'.format('|'.join(data.next()))
for n, row in enumerate(data):
    print '|'.join(row)
    if n == 5: break

# fields:
# Symbol|Name|LastSale|MarketCap|ADR TSO|IPOyear|Sector|Industry|Summary Quote|
# 
# VNET|21Vianet Group, Inc.|25.87|1137471769.46|43968758|2011|Technology|Computer Software: Programming, Data Processing|http://www.nasdaq.com/symbol/vnet|
# TWOU|2U, Inc.|13.28|534023394.4|n/a|2014|Technology|Computer Software: Prepackaged Software|http://www.nasdaq.com/symbol/twou|
# DDD|3D Systems Corporation|54.4|5630941606.4|n/a|n/a|Technology|Computer Software: Prepackaged Software|http://www.nasdaq.com/symbol/ddd|
# JOBS|51job, Inc.|64.32|746633699.52|11608111|2004|Technology|Diversified Commercial Services|http://www.nasdaq.com/symbol/jobs|
# WUBA|58.com Inc.|37.25|2959078388.5|n/a|2013|Technology|Computer Software: Programming, Data Processing|http://www.nasdaq.com/symbol/wuba|
# ATEN|A10 Networks, Inc.|10.64|638979699.12|n/a|2014|Technology|Computer Communications Equipment|http://www.nasdaq.com/symbol/aten|

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM