简体   繁体   English

在python中解析两种不同类型的csv格式

[英]Parsing two different types of csv format in python

I have several csv files in which each of them has different formats. 我有几个csv文件,其中每个文件都有不同的格式。 Here an sample of two different csv files. 这里是两个不同的csv文件的示例。 Please look at the format not values. 请查看格式而非值。

 csv_2   "xxxx-0147-xxxx-194443,""Jan 1, 2017"",7:43:43 AM PST,,Google fee,,Smart Plan (Calling & Texting),com.yuilop,1,unlimited_usca_tariff_and,mimir,US,TX,76501,USD,-3.00,0.950210,EUR,-2.85"
 csv_2  "1305-xxxx-0118-54476..1,""Jan 1, 2017"",7:17:31 AM PST,,Google fee,,Smart Plan (Calling & Texting),com.yuilop,1,unlimited_usca_tariff_and,htc_a13wlpp,US,TX,79079,USD,-3.00,0.950210,EUR,-2.85"
 csv_1 GPA.xxxx-2612-xxxx-44448..0,2017-02-01,1485950845,Charged,m1,Freedom Plan (alling & Texting),com.yuilop,subscription,basic_usca_tariff_and,USD,2.99,0.00,2.99,,,07605,US
 csv:1 GPA.xxxx-6099-9725-56125,2017-02-01,1485952917,Charged,athene_f,Buy 100 credits (Calling & Texting),com.yuilop,inapp,100_credits,INR,138.41,0.00,138.41,Kolkata,West Bengal,700007,IN

As u see csv_2 is included " and sometimes "", however csv_1 is a simple format. I get all csvs on the demand and they are a lot and huge. I tried to use sniffer in order to recognise dialect automatically. But this is not enough and I don't get the reasonable response for the one that has "" . Is there anybody who can guid me how to solve this problem? 如您所见,csv_2包含“,有时包含”,但是csv_1是一种简单的格式。我可以根据需要获得所有csv,它们又很大又很大。我尝试使用嗅探器来自动识别方言。但这不是够了,对于带有“”的那个人,我没有得到合理的答复。有人可以指导我如何解决这个问题吗?

Python code 2.7 Python代码2.7

With open(file, 'rU') as csvfile:
     dialect = csv.Sniffer().sniff(csvfile.read(2024))
     csvfile.seek(0)
     reader = csv.reader(csvfile, dialect)
     for line in reader:
      print line

Parameter Values: 参数值:

 dialect.escapechar     None
 dialect.quotechar      "
 dialect.quoting        0
 dialect.escapechar     None
 dialect.delimiter      ,
 dialect.doublequote    False

result 结果

csv_1 ['GPA.13xx-xxxx-9725-5xxx', '2017-02-01', '1485952917', 'Charged', 'athene_f', 'Buy 100 credits (Calling & Texting)', 'com.yuilop', 'inapp', '100_credits', 'INR', '138.41', '0.00', '138.41', 'Kolkata', 'West Bengal', '700007', 'IN']
csv_2  ['1330-xxxx-5560-xxxx,"Jan 1', ' 2017""', '12:35:13 AM PST', '', 'Google fee', '', 'Smart Plan (Calling & Texting)', 'com.yuilop', '1', 'unlimited_usca_tariff_and', 'astar-y3', 'US', 'NC', '27288', 'USD', '-3.00', '0.950210', 'EUR', '-2.85"']

In csv_2 , you see a mess . 在csv_2中,您会看到一团糟。 date is separated by comma specially date field and also all the row considered as a string. 日期用逗号分隔,特别是日期字段,并且所有行都视为字符串。 How can I change my code in order to have the same result as csv_1? 如何更改代码以得到与csv_1相同的结果?

为什么不对csv进行预处理以进行清理和标准化,然后像其他csv一样加载数据?

You're one step from working code. 您距工作代码仅一步之遥。 All you've got to do is first replace the " s in csvfile , then your current approach will work just fine. 所有你所要做的是先replace"以s csvfile ,那么你目前的做法会工作得很好。

EDIT: However, if you're interested in merging the date-strings that were separated after reading in the CSV file, your best bet is a Regex match. 编辑:但是,如果您有兴趣合并在读取CSV文件后分隔的日期字符串,那么最好的选择是进行正则表达式匹配。 I've included some code into my original answer. 我已经在原始答案中包含了一些代码。 I've copied most of the Regex code (with edits) from this older answer . 我已经从这个较早的答案中复制了大多数正则表达式代码(带有编辑)。

import re
import csv

with open(file, 'rU') as csvfile:
    data = csvfile.read(2024)
    # Remove the pesky double-quotes
    no_quotes_data = data.replace('"', '')

    dialect = csv.Sniffer().sniff(no_quotes_data);

    csv_data = csv.reader(no_quotes_data.splitlines(), dialect)

    pattern = r'(?i)(%s) +(%s)'

    thirties = pattern % (
        "Sep|Apr|Jun|Nov",
        r'[1-9]|[12]\d|30')

    thirtyones = pattern % (
        "Jan|Mar|May|Jul|Aug|Oct|Dec",
        r'[1-9]|[12]\d|3[01]')

    feb = r'(Feb) +(?:%s)' % (
        r'(?:([1-9]|1\d|2[0-9]))') # 1-29 any year (including potential leap years)

    result = '|'.join('(?:%s)' % x for x in (thirties, thirtyones, feb))
    r = re.compile(result)

    for ind, phrase in enumerate(csv_data):
        if r.match(phrase):
            # If you've found a date string, a year string will follow
            new_data[ind] = ", ".join(csv_data[ind:ind+2])
            del csv_data[ind+1]

    for line in csv_data: print line

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM