简体   繁体   English

跳过CSV文件中的某些字符

[英]Skipping certain characters in a CSV file

I am writing a script to parse a NASDAQ file of every company listed under the technology category. 我正在编写一个脚本来解析技术类别下列出的每个公司的纳斯达克文件。 It's a CSV separated by commas. 这是CSV,以逗号分隔。 However, sometimes a company has their name listed as XXX, Inc. That comma messes up my tabulation in the script so it gets the wrong value. 但是,有时一家公司的名称被列为XXX,Inc。逗号将脚本中的列表弄乱了,因此它得到了错误的值。 I'm parsing for the company stock symbol, so the ', Inc.' 我正在解析公司股票代码,因此“,Inc.” will mess up the places. 会弄乱地方。

I'm fairly new to Python, so I'm not experienced in it much but I have been doing the best I can and have gotten it to read and write CSVs, but this parsing issue is difficult for me. 我对Python还是很陌生,所以我对它没有太多的经验,但是我一直在尽力而为,并且已经使它能够读写CSV,但是这个解析问题对我来说很难。 This is what I currently have: 这是我目前拥有的:

try:
    # py3
    from urllib.request import Request, urlopen
    from urllib.parse import urlencode
except ImportError:
    # py2
    from urllib2 import Request, urlopen
    from urllib import urlencode

import csv
import urllib.request
import string

def _request():
    url = 'http://www.nasdaq.com/screening/companies-by-industry.aspx?industry=Technology&render=download'
    req = Request(url)
    resp = urlopen(req)
    content = resp.read().decode().strip()
    content1 = content.replace('"', '')
    return content1

def symbol_quote():
    counter = 1
    recursive = 9*counter

    values = _request().split(',')
    values2 = values[recursive]
    return values2
    counter += 1


def csvwrite():
    import csv
    path = "symbol_comp.csv"
    data = [symbol_quote()]
    parsing = False

    with open(path, 'w', newline='') as csv_file:
        writer = csv.writer(csv_file, delimiter=' ')
        for line in data:
            writer.writerow(line)

I haven't made it so it loops and acts according to a counter yet because there's no point right now. 我还没有做到这一点,所以它循环并根据一个计数器执行操作,因为现在没有意义。 This parsing issue is more pressing. 这个解析问题更加紧迫。

Could anyone please help a newbie out? 谁能帮一个新手吗?

Change _request() to use csv.reader() with cStringIO.StringIO() and return a csv.reader object that you can iterate over: 更改_request()使用csv.reader()cStringIO.StringIO()并返回一个csv.reader对象,您可以遍历:

try:
    # py3
    from urllib.request import Request, urlopen
    from urllib.parse import urlencode
except ImportError:
    # py2
    from urllib2 import Request, urlopen
    from urllib import urlencode

import csv, cStringIO
##import urllib.request
import string

def _request():
    url = 'http://www.nasdaq.com/screening/companies-by-industry.aspx?industry=Technology&render=download'
    req = Request(url)
    resp = urlopen(req)
    sio = cStringIO.StringIO(resp.read().decode().strip())
    reader = csv.reader(sio)
    return reader

Usage: 用法:

data = _request()
print 'fields:\n{}\n'.format('|'.join(data.next()))
for n, row in enumerate(data):
    print '|'.join(row)
    if n == 5: break

# fields:
# Symbol|Name|LastSale|MarketCap|ADR TSO|IPOyear|Sector|Industry|Summary Quote|
# 
# VNET|21Vianet Group, Inc.|25.87|1137471769.46|43968758|2011|Technology|Computer Software: Programming, Data Processing|http://www.nasdaq.com/symbol/vnet|
# TWOU|2U, Inc.|13.28|534023394.4|n/a|2014|Technology|Computer Software: Prepackaged Software|http://www.nasdaq.com/symbol/twou|
# DDD|3D Systems Corporation|54.4|5630941606.4|n/a|n/a|Technology|Computer Software: Prepackaged Software|http://www.nasdaq.com/symbol/ddd|
# JOBS|51job, Inc.|64.32|746633699.52|11608111|2004|Technology|Diversified Commercial Services|http://www.nasdaq.com/symbol/jobs|
# WUBA|58.com Inc.|37.25|2959078388.5|n/a|2013|Technology|Computer Software: Programming, Data Processing|http://www.nasdaq.com/symbol/wuba|
# ATEN|A10 Networks, Inc.|10.64|638979699.12|n/a|2014|Technology|Computer Communications Equipment|http://www.nasdaq.com/symbol/aten|

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM