I have a question concerning reading bits and pieces of a csv file. When just reading the file using
pd.read_csv(path,sep=';',na_values=[''],thousands='.',decimal=',',date_parser=[0])
I get:
EUR 1Y 2Y 3Y
0 2013-09-25 0,198 0,307 0,485
1 2013-09-26 0,204 0,318 0,497
2 2013-09-27 0,204 0,306 0,487
3 2013-09-28 0,204 0,306 0,487
4 USD 1Y 2Y 3Y
5 2013-09-25 0,462 0,571 0,749
6 2013-09-26 0,468 0,582 0,761
7 2013-09-27 0,468 0,57 0,751
8 2013-09-28 0,468 0,57 0,751
As you can see, the data is arranged date-wise, and each data set is in chunks after one another (in this case the USD-data comes straight after the EUR-data). The currency-label tricks up things a bit, and the data becomes one single data frame.
What I would like to have are two separate data frames, as
EUR 1Y 2Y 3Y
0 2013-09-25 0,198 0,307 0,485
1 2013-09-26 0,204 0,318 0,497
2 2013-09-27 0,204 0,306 0,487
3 2013-09-28 0,204 0,306 0,487
USD 1Y 2Y 3Y
0 2013-09-25 0,462 0,571 0,749
1 2013-09-26 0,468 0,582 0,761
2 2013-09-27 0,468 0,57 0,751
3 2013-09-28 0,468 0,57 0,751
That is, I would like to separate each currency data set from one another.
Any suggestions?
Here is an alternative approach to the problem. It reads the csv into a single DataFrame and then uses a bit of data-wrangling to create a currency column:
currency 1Y 2Y 3Y
date
2013-09-25 EUR 0,198 0,307 0,485
2013-09-26 EUR 0,204 0,318 0,497
2013-09-27 EUR 0,204 0,306 0,487
2013-09-28 EUR 0,204 0,306 0,487
2013-09-25 USD 0,462 0,571 0,749
2013-09-26 USD 0,468 0,582 0,761
2013-09-27 USD 0,468 0,57 0,751
2013-09-28 USD 0,468 0,57 0,751
You could then "split" the DataFrame into smaller DataFrames according to the currency using groupby
:
groups = df.groupby(['currency'])
for key, grp in groups:
print(grp)
import numpy as np
import pandas as pd
df = pd.read_table('data',sep=';',na_values=[''],thousands='.',decimal=',',
names=['date', '1Y', '2Y', '3Y'])
mask = df['date'].str.contains('^\s*\D') # 1
df['currency'] = (df['date']
.where(mask, np.nan) # 2
.fillna(method='ffill')) # 3
df = df.loc[~mask] # 4
print(df)
groups = df.groupby(['currency'])
for key, grp in groups:
print(grp)
Use str.contains
to find values in df['date']
which begin with a non-digit. The values are presumed to be currencies. The mask
is True
on these rows.
In [120]: mask Out[120]: 0 True 1 False 2 False 3 False 4 False 5 True 6 False 7 False 8 False 9 False Name: date, dtype: bool
df['date'].where(mask, np.nan)
returns a Series, equal to df['date']
where the mask is True
, and np.nan
otherwise. Forward-fill in the nans
with the currency values
In [123]: df['date'].where(mask, np.nan).fillna(method='ffill') Out[123]: 0 EUR 1 EUR 2 EUR 3 EUR 4 EUR 5 USD 6 USD 7 USD 8 USD 9 USD Name: date, dtype: object
False
, thus removing the header rows. Use the nrows
and skiprows
parameter to read_csv
So for first dataframe read just the first 4 rows:
eur = pd.read_csv(path,sep=';',na_values=[''],thousands='.',decimal=',',date_parser=[0], nrows=4)
and subsequent dataframe skip first 5 rows:
usd = pd.read_csv(path,sep=';',na_values=[''],thousands='.',decimal=',',date_parser=[0], skiprows=5)
should work
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.