I have a dataframe that looks like the following:
Date A B Number
2017-01-01 a b 0.9240
2017-01-01 b c 0.9101
2017-01-01 d e 0.8761
2017-01-01 c g 0.9762
2017-01-02 b c 0.5637
2017-01-02 c d 0.9643
I want to have a dataframe of unique values in A and B for each day, depending on the number in the number column. I think the logic would be in the following order:
As an example, from the dataframe above, because there is a 'b' in column A and column B on Jan 1st, 2017, I want to compare 0.9240 and 0.9101 and return the row with the 0.9240 because it's higher than 0.9101.
The end product should look as follows:
Date A B Number
2017-01-01 a b 0.9240
2017-01-01 d e 0.8761
2017-01-01 c g 0.9762
2017-01-02 c d 0.9643
It's complex, but absolutely possible to do so.
First let's ensure that the data is in the correct format:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
Date 6 non-null datetime64[ns]
A 6 non-null object
B 6 non-null object
Number 6 non-null float64
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 272.0+ bytes
Note that the Date
column is of type datetime64
. This is necessary because having those values as timestamps allows to use pandas resample
method to group data on a daily basis.
After resampling the data a custom method extract
can be applied. This method gets one group as a data frame and applies the logic. By using pandas pivot_table
method it's easier to find the intersection between the columns A and B. I'm not sure if this is the most efficient approach but if the dataset is not too large it should work sufficiently fast.
The full code looks like this:
def extract(df):
dfs = []
pt = df.reset_index().pivot_table('Number', columns=['A', 'B'], index='Date')
# find any intersection of values between col A and B
intersection = set(pt.columns.levels[0].values)\
.intersection(set(pt.columns.levels[1].values))
# iterate over all intersections to compare their values
# and choose the largest one
for value in intersection:
mask = (df['A'] == value) | (df['B'] == value)
df_intersection = df[mask]\
.sort_values('Number', ascending=False)
dfs.append(df_intersection.ix[[0]])
# find all rows that do not contain any intersections
df_rest = df[(~df['A'].isin(list(intersection))) &\
(~df['B'].isin(list(intersection)))]
if (len(df_rest) > 0):
dfs.append(df_rest)
return pd.concat(dfs)
df.set_index('Date')\
.resample('d')\
.apply(extract)\
.reset_index(level=1, drop=True)
This code results in:
A B Number
Date
2017-01-01 a b 0.9240
2017-01-01 c g 0.9762
2017-01-01 d e 0.8761
2017-01-02 c d 0.9643
The code above is based on the given dataset:
import pandas as pd
from io import StringIO
data = StringIO("""\
Date A B Number
2017-01-01 a b 0.9240
2017-01-01 b c 0.9101
2017-01-01 d e 0.8761
2017-01-01 c g 0.9762
2017-01-02 b c 0.5637
2017-01-02 c d 0.9643
""")
df = pd.read_csv(data, sep='\s+', parse_dates=[0])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.