简体   繁体   中英

Distributing the random data records across the day using Python

I'm designing the data simulator, which generates some records based on the limit, limit can be anything from 100 to 10000

limit = 100

the records should be distributed across whole day ex: 15% of the records in 0th hour, 20% in 1st hour, 5% in 2nd hour and so on...

How to simulate this kind of distribution using python, which library would help to design the logic?

Right now I am able to simulate records like below

t_id    t_amount    gateway    transaction_date
101     30          Master     11/05/2016
102     10          Amex       11/05/2016

If you look at the transaction date, it doesn't have a timestamp. But I want to have timestamp like below records, where all the 100 records have distributed across whole day, how to achieve it?

t_id    t_amount    gateway    transaction_date
101     30          Master     11/05/2016 00:21:42
102     10          Amex       11/05/2016 01:22:42

Here's one way to generate something along the lines of what you describe. Note that limit can be made random, as can be the weights per hour.

In [78]: df.tail()
Out[78]:
                    gateway  t_amount  t_id
transaction_date
2016-11-05 03:00:00    Amex        68   195
2016-11-05 03:00:00    Amex        41   196
2016-11-05 03:00:00  Master        66   197
2016-11-05 03:00:00    Amex        59   198
2016-11-05 03:00:00    Amex        45   199

The code below pregenerates the hours given the desired number of observations limit and weights per hour. It then uses the random module from Numpy to generate the sample data. Check out their documentation for other distributions.

import numpy as np
import pandas as pd

#total number of observations:
limit = 10**2
N = 100
#percent of transactions during that hour.
weights_per_hour= (np.array([.35, .25, .25, .15])*limit).astype(int)

#generate time range using Pandas datetime functions
time_range = pd.date_range(start = '20161105',freq='H', periods=4)

#generate data index according to the hour distribution.
time_indx  = time_range.repeat(weights_per_hour)

#create temp data frame as a housing unit.
dat_dict =  {"t_id":[x+100 for x in range(N)], "transaction_date":time_indx}
temp_df = pd.DataFrame(dat_dict)

#enter the choices for transaction type
gateway_choice = np.array(['Master', 'Amex'])

#generate random data
rnd_df = pd.DataFrame({"t_amount":np.random.randint(low=1, high=100,size=limit), "gateway":np.random.choice(gateway_choice,limit)})

#attach random data to to temp_df
df = pd.concat([rnd_df, temp_df], axis=1)
df.set_index('transaction_date', inplace=True)

In the code above, the index is in a timestamp format. You may have to play around for it to print but it is certainly stored. To convert it into a Pandas non-index format, use pd.index.to_datetime() and df.reset_index(df.index) to put it into the dataframe.

我查看了作为标准库一部分的random包的文档,您会发现它确实支持生成具有正态(高斯)分布的数字。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM