繁体   English   中英

跳过CSV文件中的某些字符

[英]Skipping certain characters in a CSV file

我正在编写一个脚本来解析技术类别下列出的每个公司的纳斯达克文件。 这是CSV,以逗号分隔。 但是,有时一家公司的名称被列为XXX,Inc。逗号将脚本中的列表弄乱了,因此它得到了错误的值。 我正在解析公司股票代码,因此“,Inc.” 会弄乱地方。

我对Python还是很陌生,所以我对它没有太多的经验,但是我一直在尽力而为,并且已经使它能够读写CSV,但是这个解析问题对我来说很难。 这是我目前拥有的:

try:
    # py3
    from urllib.request import Request, urlopen
    from urllib.parse import urlencode
except ImportError:
    # py2
    from urllib2 import Request, urlopen
    from urllib import urlencode

import csv
import urllib.request
import string

def _request():
    url = 'http://www.nasdaq.com/screening/companies-by-industry.aspx?industry=Technology&render=download'
    req = Request(url)
    resp = urlopen(req)
    content = resp.read().decode().strip()
    content1 = content.replace('"', '')
    return content1

def symbol_quote():
    counter = 1
    recursive = 9*counter

    values = _request().split(',')
    values2 = values[recursive]
    return values2
    counter += 1


def csvwrite():
    import csv
    path = "symbol_comp.csv"
    data = [symbol_quote()]
    parsing = False

    with open(path, 'w', newline='') as csv_file:
        writer = csv.writer(csv_file, delimiter=' ')
        for line in data:
            writer.writerow(line)

我还没有做到这一点,所以它循环并根据一个计数器执行操作,因为现在没有意义。 这个解析问题更加紧迫。

谁能帮一个新手吗?

更改_request()使用csv.reader()cStringIO.StringIO()并返回一个csv.reader对象,您可以遍历:

try:
    # py3
    from urllib.request import Request, urlopen
    from urllib.parse import urlencode
except ImportError:
    # py2
    from urllib2 import Request, urlopen
    from urllib import urlencode

import csv, cStringIO
##import urllib.request
import string

def _request():
    url = 'http://www.nasdaq.com/screening/companies-by-industry.aspx?industry=Technology&render=download'
    req = Request(url)
    resp = urlopen(req)
    sio = cStringIO.StringIO(resp.read().decode().strip())
    reader = csv.reader(sio)
    return reader

用法:

data = _request()
print 'fields:\n{}\n'.format('|'.join(data.next()))
for n, row in enumerate(data):
    print '|'.join(row)
    if n == 5: break

# fields:
# Symbol|Name|LastSale|MarketCap|ADR TSO|IPOyear|Sector|Industry|Summary Quote|
# 
# VNET|21Vianet Group, Inc.|25.87|1137471769.46|43968758|2011|Technology|Computer Software: Programming, Data Processing|http://www.nasdaq.com/symbol/vnet|
# TWOU|2U, Inc.|13.28|534023394.4|n/a|2014|Technology|Computer Software: Prepackaged Software|http://www.nasdaq.com/symbol/twou|
# DDD|3D Systems Corporation|54.4|5630941606.4|n/a|n/a|Technology|Computer Software: Prepackaged Software|http://www.nasdaq.com/symbol/ddd|
# JOBS|51job, Inc.|64.32|746633699.52|11608111|2004|Technology|Diversified Commercial Services|http://www.nasdaq.com/symbol/jobs|
# WUBA|58.com Inc.|37.25|2959078388.5|n/a|2013|Technology|Computer Software: Programming, Data Processing|http://www.nasdaq.com/symbol/wuba|
# ATEN|A10 Networks, Inc.|10.64|638979699.12|n/a|2014|Technology|Computer Communications Equipment|http://www.nasdaq.com/symbol/aten|

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM