In the coming months, I’ll prepare some tutorials over an excellent data analysis package called pandas !
To show you the power of pandas, just take a look at this old tutorial, where I exploited the power of itertools to group sparse data into 5 seconds bins.
The magic of pandas is that, when you two arrays, “times” and “values”, then you can create a DataFrame of values, indexed by times. Once this is done, you can resample the index array and choose a grouping function (“mean”, “sum” or pass a function: np.sum, np.median, etc…).
Once the times array was defined, the core of the code was :
SECOND = 1 MINUTE = SECOND * 60 HOUR = MINUTE * 60 DAY = HOUR * 24 binning = 5*SECOND def group(di): return int(calendar.timegm(di.timetuple()))/binning list_of_dates = np.array(times,dtype=datetime.datetime) grouped_dates = [[datetime.datetime(*time.gmtime(d*binning)[:6]), len(list(g))] for d,g in itertools.groupby(list_of_dates, group)] grouped_dates = zip(*grouped_dates) plt.bar(grouped_dates[0],grouped_dates[1],width=float(binning)/DAY)
Now, doing the same with pandas is … only 4 lines of code…
binning=5 dt = pd.DataFrame(np.ones(N),index=times) rs = dt.resample("%iS"%binning,how=np.sum) rs.plot(kind='bar')
And the whole code could even be shorter by using pandas’ built-in functions to create date/time spans etc…
The full code is after the break:
import numpy as np import matplotlib.pyplot as plt import datetime, time, calendar from matplotlib.dates import num2date, DateFormatter import pandas as pd N = 100000 binning = 5 starttime = time.time() basetimes = sorted(np.random.random(N)*np.random.random(N)*1.0e3+starttime) times = [datetime.datetime(*time.gmtime(a)[:7]) for a in basetimes] for i, atime in enumerate(times): times[i] = atime + datetime.timedelta(microseconds=(basetimes[i]-int(basetimes[i])) * 1e6) dt = pd.DataFrame(np.ones(N),index=times) rs = dt.resample("%iS"%binning,how=np.sum) rs.plot(kind='bar') plt.grid(True) plt.title('Number of random datetimes per %i seconds' % binning) plt.show()