How can I read *.csv files that have numbers with commas using pandas?

Question

I want to read a *.csv file that have numbers with commas.

For example,

File.csv

Date, Time, Open, High, Low, Close, Volume
2016/11/09,12:10:00,'4355,'4358,'4346,'4351,1,201 # The last value is 1201, not 201
2016/11/09,12:09:00,'4361,'4362,'4353,'4355,1,117 # The last value is 1117, not 117
2016/11/09,12:08:00,'4364,'4374,'4359,'4360,10,175 # The last value is 10175, not 175
2016/11/09,12:07:00,'4371,'4376,'4360,'4365,590
2016/11/09,12:06:00,'4359,'4372,'4358,'4369,420
2016/11/09,12:05:00,'4365,'4367,'4356,'4359,542
2016/11/09,12:04:00,'4379,'1380,'4360,'4365,1,697 # The last value is 1697, not 697
2016/11/09,12:03:00,'4394,'4396,'4376,'4381,1,272 # The last value is 1272, not 272
2016/11/09,12:02:00,'4391,'4399,'4390,'4393,524
...
2014/07/10,12:05:00,'10195,'10300,'10155,'10290,219,271 # The last value is 219271, not 271
2014/07/09,12:04:00,'10345,'10360,'10185,'10194,235,711 # The last value is 235711, not 711
2014/07/08,12:03:00,'10339,'10420,'10301,'10348,232,050 # The last value is 242050, not 050

It actually has 7 columns, but the values of the last column sometimes have commas and pandas take them as extra columns.

My questions is, if there are any methods with which I can make pandas takes only the first 6 commas and ignore the rest commas when it reads columns, or if there are any methods to delete commas after the 6th commas(I'm sorry, but I can't think of any functions to do that.)

Thank you for reading this :)

Answer 1

One more way to solve your problem.

import re
import pandas as pd

l1 =[]
with open('/home/yusuf/Desktop/c1') as f:
    headers = map(lambda x: x.strip(), f.readline().strip('\n').split(','))
    for a in f.readlines():
        b = re.findall("(.*?),(.*?),'(.*?),'(.*?),'(.*?),'(.*?),(.*)",a)
        l1.append(list(b[0]))
df = pd.DataFrame(data=l1, columns=headers)
df['Volume'] = df['Volume'].apply(lambda x: x.replace(",",""))
df

Output:

Regex Demo:
https://regex101.com/r/o1zxtO/2

Answer 2

I'm pretty sure pandas can't handle that, but you can easily fix the final column. An approach in Python

    with open('yourfile.csv') as csv, open('newcsv.csv','w') as result:
        for line in csv:
            columns = line.split(',')
            if len(columns) > COLUMNAMOUNT:
                columns[COLUMNAMOUNT-1] += ''.join(columns[COLUMNAMOUNT:])
            result.write(','.join(columns[COLUMNAMOUNT-1]))

Now you can load the new csv in to pandas. Other solutions can be AWK or even shell scripting.

Answer 3

You can do all of it in Python without having to save the data into a new file. The idea is to clean the data and put in a dictionary-like format for pandas to grab it and turn it into a dataframe. The following should constitute a decent starting point:

from collections import defaultdict
from collections import OrderedDict
import pandas as pd

# Import the data
data = open('prices.csv').readlines()

# Split on the first 6 commas
data = [x.strip().replace("'","").split(",",6) for x in data]

# Get the headers
headers = [x.strip() for x in data[0]]

# Get the remaining of the data
remainings = [list(map(lambda y: y.replace(",",""), x)) for x in data[1:]]

# Create a dictionary-like container
output = defaultdict(list)

# Loop through the data and save the rows accordingly
for n, header in enumerate(headers):
    for row in remainings:
        output[header].append(row[n])

# Save it in an ordered dictionary to maintain the order of columns
output = OrderedDict((k,output.get(k)) for k in headers)
# Convert your raw data into a pandas dataframe
df = pd.DataFrame(output)

# Print it
print(df)

This yields:

         Date      Time  Open  High   Low Close Volume
0  2016/11/09  12:10:00  4355  4358  4346  4351   1201
1  2016/11/09  12:09:00  4361  4362  4353  4355   1117
2  2016/11/09  12:08:00  4364  4374  4359  4360  10175
3  2016/11/09  12:07:00  4371  4376  4360  4365    590
4  2016/11/09  12:06:00  4359  4372  4358  4369    420
5  2016/11/09  12:05:00  4365  4367  4356  4359    542
6  2016/11/09  12:04:00  4379  1380  4360  4365   1697
7  2016/11/09  12:03:00  4394  4396  4376  4381   1272
8  2016/11/09  12:02:00  4391  4399  4390  4393    524

The starting file ( prices.csv ) is the following:

Date, Time, Open, High, Low, Close, Volume
2016/11/09,12:10:00,'4355,'4358,'4346,'4351,1,201
2016/11/09,12:09:00,'4361,'4362,'4353,'4355,1,117
2016/11/09,12:08:00,'4364,'4374,'4359,'4360,10,175
2016/11/09,12:07:00,'4371,'4376,'4360,'4365,590
2016/11/09,12:06:00,'4359,'4372,'4358,'4369,420
2016/11/09,12:05:00,'4365,'4367,'4356,'4359,542
2016/11/09,12:04:00,'4379,'1380,'4360,'4365,1,697
2016/11/09,12:03:00,'4394,'4396,'4376,'4381,1,272
2016/11/09,12:02:00,'4391,'4399,'4390,'4393,524

I hope this helps.

Answer 4

I guess pandas cant handle it so I would do a pre-processing with Perl to generate a new cvs and work on it.

Using Perl split can help you in this situation

perl -pne '$_ = join("|", split(/,/, $_, 7) )' < input.csv > output.csv

Then you can use the usual read_cvs on the output file with the seperator as |

How can I read *.csv files that have numbers with commas using pandas?

Question

4 answers

solution1
1 2016-12-10 14:36:56

solution2
1 2016-12-10 15:10:22

solution3
1 ACCPTED 2016-12-10 15:45:21

solution4
0 2016-12-10 15:22:34

How can I read *.csv files that have numbers with commas using pandas?

Question

4 answers

solution1 1 2016-12-10 14:36:56

solution2 1 2016-12-10 15:10:22

solution3 1 ACCPTED 2016-12-10 15:45:21

solution4 0 2016-12-10 15:22:34

solution1
1 2016-12-10 14:36:56

solution2
1 2016-12-10 15:10:22

solution3
1 ACCPTED 2016-12-10 15:45:21

solution4
0 2016-12-10 15:22:34