简体   繁体   中英

How can I read *.csv files that have numbers with commas using pandas?

I want to read a *.csv file that have numbers with commas.

For example,

File.csv

Date, Time, Open, High, Low, Close, Volume
2016/11/09,12:10:00,'4355,'4358,'4346,'4351,1,201 # The last value is 1201, not 201
2016/11/09,12:09:00,'4361,'4362,'4353,'4355,1,117 # The last value is 1117, not 117
2016/11/09,12:08:00,'4364,'4374,'4359,'4360,10,175 # The last value is 10175, not 175
2016/11/09,12:07:00,'4371,'4376,'4360,'4365,590
2016/11/09,12:06:00,'4359,'4372,'4358,'4369,420
2016/11/09,12:05:00,'4365,'4367,'4356,'4359,542
2016/11/09,12:04:00,'4379,'1380,'4360,'4365,1,697 # The last value is 1697, not 697
2016/11/09,12:03:00,'4394,'4396,'4376,'4381,1,272 # The last value is 1272, not 272
2016/11/09,12:02:00,'4391,'4399,'4390,'4393,524
...
2014/07/10,12:05:00,'10195,'10300,'10155,'10290,219,271 # The last value is 219271, not 271
2014/07/09,12:04:00,'10345,'10360,'10185,'10194,235,711 # The last value is 235711, not 711
2014/07/08,12:03:00,'10339,'10420,'10301,'10348,232,050 # The last value is 242050, not 050

It actually has 7 columns, but the values of the last column sometimes have commas and pandas take them as extra columns.

My questions is, if there are any methods with which I can make pandas takes only the first 6 commas and ignore the rest commas when it reads columns, or if there are any methods to delete commas after the 6th commas(I'm sorry, but I can't think of any functions to do that.)

Thank you for reading this :)

One more way to solve your problem.

import re
import pandas as pd

l1 =[]
with open('/home/yusuf/Desktop/c1') as f:
    headers = map(lambda x: x.strip(), f.readline().strip('\n').split(','))
    for a in f.readlines():
        b = re.findall("(.*?),(.*?),'(.*?),'(.*?),'(.*?),'(.*?),(.*)",a)
        l1.append(list(b[0]))
df = pd.DataFrame(data=l1, columns=headers)
df['Volume'] = df['Volume'].apply(lambda x: x.replace(",",""))
df

Output:

在此输入图像描述

Regex Demo:
https://regex101.com/r/o1zxtO/2

I'm pretty sure pandas can't handle that, but you can easily fix the final column. An approach in Python

    with open('yourfile.csv') as csv, open('newcsv.csv','w') as result:
        for line in csv:
            columns = line.split(',')
            if len(columns) > COLUMNAMOUNT:
                columns[COLUMNAMOUNT-1] += ''.join(columns[COLUMNAMOUNT:])
            result.write(','.join(columns[COLUMNAMOUNT-1]))

Now you can load the new csv in to pandas. Other solutions can be AWK or even shell scripting.

You can do all of it in Python without having to save the data into a new file. The idea is to clean the data and put in a dictionary-like format for pandas to grab it and turn it into a dataframe. The following should constitute a decent starting point:

from collections import defaultdict
from collections import OrderedDict
import pandas as pd

# Import the data
data = open('prices.csv').readlines()

# Split on the first 6 commas
data = [x.strip().replace("'","").split(",",6) for x in data]

# Get the headers
headers = [x.strip() for x in data[0]]

# Get the remaining of the data
remainings = [list(map(lambda y: y.replace(",",""), x)) for x in data[1:]]

# Create a dictionary-like container
output = defaultdict(list)

# Loop through the data and save the rows accordingly
for n, header in enumerate(headers):
    for row in remainings:
        output[header].append(row[n])

# Save it in an ordered dictionary to maintain the order of columns
output = OrderedDict((k,output.get(k)) for k in headers)
# Convert your raw data into a pandas dataframe
df = pd.DataFrame(output)

# Print it
print(df)

This yields:

         Date      Time  Open  High   Low Close Volume
0  2016/11/09  12:10:00  4355  4358  4346  4351   1201
1  2016/11/09  12:09:00  4361  4362  4353  4355   1117
2  2016/11/09  12:08:00  4364  4374  4359  4360  10175
3  2016/11/09  12:07:00  4371  4376  4360  4365    590
4  2016/11/09  12:06:00  4359  4372  4358  4369    420
5  2016/11/09  12:05:00  4365  4367  4356  4359    542
6  2016/11/09  12:04:00  4379  1380  4360  4365   1697
7  2016/11/09  12:03:00  4394  4396  4376  4381   1272
8  2016/11/09  12:02:00  4391  4399  4390  4393    524

The starting file ( prices.csv ) is the following:

Date, Time, Open, High, Low, Close, Volume
2016/11/09,12:10:00,'4355,'4358,'4346,'4351,1,201
2016/11/09,12:09:00,'4361,'4362,'4353,'4355,1,117
2016/11/09,12:08:00,'4364,'4374,'4359,'4360,10,175
2016/11/09,12:07:00,'4371,'4376,'4360,'4365,590
2016/11/09,12:06:00,'4359,'4372,'4358,'4369,420
2016/11/09,12:05:00,'4365,'4367,'4356,'4359,542
2016/11/09,12:04:00,'4379,'1380,'4360,'4365,1,697
2016/11/09,12:03:00,'4394,'4396,'4376,'4381,1,272
2016/11/09,12:02:00,'4391,'4399,'4390,'4393,524

I hope this helps.

I guess pandas cant handle it so I would do a pre-processing with Perl to generate a new cvs and work on it.

Using Perl split can help you in this situation

perl -pne '$_ = join("|", split(/,/, $_, 7) )' < input.csv > output.csv

Then you can use the usual read_cvs on the output file with the seperator as |

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM