I want to read a *.csv file that have numbers with commas.
For example,
File.csv
Date, Time, Open, High, Low, Close, Volume
2016/11/09,12:10:00,'4355,'4358,'4346,'4351,1,201 # The last value is 1201, not 201
2016/11/09,12:09:00,'4361,'4362,'4353,'4355,1,117 # The last value is 1117, not 117
2016/11/09,12:08:00,'4364,'4374,'4359,'4360,10,175 # The last value is 10175, not 175
2016/11/09,12:07:00,'4371,'4376,'4360,'4365,590
2016/11/09,12:06:00,'4359,'4372,'4358,'4369,420
2016/11/09,12:05:00,'4365,'4367,'4356,'4359,542
2016/11/09,12:04:00,'4379,'1380,'4360,'4365,1,697 # The last value is 1697, not 697
2016/11/09,12:03:00,'4394,'4396,'4376,'4381,1,272 # The last value is 1272, not 272
2016/11/09,12:02:00,'4391,'4399,'4390,'4393,524
...
2014/07/10,12:05:00,'10195,'10300,'10155,'10290,219,271 # The last value is 219271, not 271
2014/07/09,12:04:00,'10345,'10360,'10185,'10194,235,711 # The last value is 235711, not 711
2014/07/08,12:03:00,'10339,'10420,'10301,'10348,232,050 # The last value is 242050, not 050
It actually has 7 columns, but the values of the last column sometimes have commas and pandas take them as extra columns.
My questions is, if there are any methods with which I can make pandas takes only the first 6 commas and ignore the rest commas when it reads columns, or if there are any methods to delete commas after the 6th commas(I'm sorry, but I can't think of any functions to do that.)
Thank you for reading this :)
One more way to solve your problem.
import re
import pandas as pd
l1 =[]
with open('/home/yusuf/Desktop/c1') as f:
headers = map(lambda x: x.strip(), f.readline().strip('\n').split(','))
for a in f.readlines():
b = re.findall("(.*?),(.*?),'(.*?),'(.*?),'(.*?),'(.*?),(.*)",a)
l1.append(list(b[0]))
df = pd.DataFrame(data=l1, columns=headers)
df['Volume'] = df['Volume'].apply(lambda x: x.replace(",",""))
df
Output:
Regex Demo:
https://regex101.com/r/o1zxtO/2
I'm pretty sure pandas
can't handle that, but you can easily fix the final column. An approach in Python
with open('yourfile.csv') as csv, open('newcsv.csv','w') as result:
for line in csv:
columns = line.split(',')
if len(columns) > COLUMNAMOUNT:
columns[COLUMNAMOUNT-1] += ''.join(columns[COLUMNAMOUNT:])
result.write(','.join(columns[COLUMNAMOUNT-1]))
Now you can load the new csv in to pandas. Other solutions can be AWK or even shell scripting.
You can do all of it in Python without having to save the data into a new file. The idea is to clean the data and put in a dictionary-like format for pandas to grab it and turn it into a dataframe. The following should constitute a decent starting point:
from collections import defaultdict
from collections import OrderedDict
import pandas as pd
# Import the data
data = open('prices.csv').readlines()
# Split on the first 6 commas
data = [x.strip().replace("'","").split(",",6) for x in data]
# Get the headers
headers = [x.strip() for x in data[0]]
# Get the remaining of the data
remainings = [list(map(lambda y: y.replace(",",""), x)) for x in data[1:]]
# Create a dictionary-like container
output = defaultdict(list)
# Loop through the data and save the rows accordingly
for n, header in enumerate(headers):
for row in remainings:
output[header].append(row[n])
# Save it in an ordered dictionary to maintain the order of columns
output = OrderedDict((k,output.get(k)) for k in headers)
# Convert your raw data into a pandas dataframe
df = pd.DataFrame(output)
# Print it
print(df)
This yields:
Date Time Open High Low Close Volume
0 2016/11/09 12:10:00 4355 4358 4346 4351 1201
1 2016/11/09 12:09:00 4361 4362 4353 4355 1117
2 2016/11/09 12:08:00 4364 4374 4359 4360 10175
3 2016/11/09 12:07:00 4371 4376 4360 4365 590
4 2016/11/09 12:06:00 4359 4372 4358 4369 420
5 2016/11/09 12:05:00 4365 4367 4356 4359 542
6 2016/11/09 12:04:00 4379 1380 4360 4365 1697
7 2016/11/09 12:03:00 4394 4396 4376 4381 1272
8 2016/11/09 12:02:00 4391 4399 4390 4393 524
The starting file ( prices.csv
) is the following:
Date, Time, Open, High, Low, Close, Volume
2016/11/09,12:10:00,'4355,'4358,'4346,'4351,1,201
2016/11/09,12:09:00,'4361,'4362,'4353,'4355,1,117
2016/11/09,12:08:00,'4364,'4374,'4359,'4360,10,175
2016/11/09,12:07:00,'4371,'4376,'4360,'4365,590
2016/11/09,12:06:00,'4359,'4372,'4358,'4369,420
2016/11/09,12:05:00,'4365,'4367,'4356,'4359,542
2016/11/09,12:04:00,'4379,'1380,'4360,'4365,1,697
2016/11/09,12:03:00,'4394,'4396,'4376,'4381,1,272
2016/11/09,12:02:00,'4391,'4399,'4390,'4393,524
I hope this helps.
I guess pandas cant handle it so I would do a pre-processing with Perl to generate a new cvs and work on it.
Using Perl split can help you in this situation
perl -pne '$_ = join("|", split(/,/, $_, 7) )' < input.csv > output.csv
Then you can use the usual read_cvs on the output file with the seperator as |
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.