The next logical step would be to investigate and try to simulate the distributions of dates, and transaction amounts. To this end, statistical distributions can be fit to date distributions, transaction amount distributions, and, given a certain timeframe, the probability of the occurence of certain transaction types can be deduced.
From this, one can design a script that determines probability weights for transaction dates and transaction amounts based on their occurrences in the real labelled dataset to be augmented and enlarged. This method has the shortcoming that it will only ever be as good as the amount of ‘parameters’ one can introduce by hand. The handwritten-algorithm generated datasets can seem realistic to the naked eye, but neural networks trained on such data will pick out patterns with relative ease due to the lack of complexity, as previously mentioned, and will most likely not fare well when applied to real data.
An interesting tool that was found during this phase of work was the python fitter package, which can fit the most optimal scipy statistical distribution to any list of numbers and return the optimal distribution name and parameters to sample synthetic data from.
Transaction amount data is also very positively skewed, which brings a challenge when trying to fit statistical distributions to it. To overcome the challenge of data skewness, a Box-Cox transform was applied to all transaction amounts in the original dataset. To generate a synthetic transaction amount, we first perform a Box-Cox transform on the original transaction amount data and store the Box-Cox transform parameter. We then apply the fitter method to the transformed data to obtain the optimal distribution name and the distribution parameters from which we then proceed to sample a random amount from, given the correct parameters, and finally perform an inverse Box-Cox transform to the amount using the saved Box-Cox parameter.
This seems like a tedious procedure but it has a few benefits in that distributions and Box-Cox parameters only have to be calculated once for all transaction categories, it generates data which very closely resembles the original transaction amount distributions and the Box-Cox transform ensures that generated data is always positive which avoids the issue of the model accidentally generating a negative transaction amount.
The greatest challenge of this method was generating realistic transaction references for each transaction. The crude way that this problem was addressed was to create a list of roughly 30 real transaction references for each transaction category. The model was then designed in such a way that it would create a vocabulary list of all words in the reference list for each category and, with a weighted probability, would choose to either pick one of the original 30 references as the reference for the transaction being currently generated or to synthesise a new reference by randomly sampling a certain amount of words from the vocabulary list. This amount of words was determined by sampling an integer amount from a normal distribution where the mean and standard deviation was calculated from the amount of words per reference in the original dataset.
A major benefit of this hand-coded method, in general, is that it is very quick to get it up and running (assuming that access to real data is available) and that it forces one to gain an underlying understanding about how distributions in the data are connected. For example, one learns that different types of transactions tend to happen around the same time of the month or what size of transaction amounts can be expected for certain transaction categories. This is also one of the method’s drawbacks though, as one not only needs a clear understanding of the nuances in the data before starting to perform statistical fits, one also needs to know exactly what output data should look like. A challenge that was encountered initially in this case, for example, was that data needs to be generated on a per client basis and not as one large batch to be sifted through later. Every unique transaction category also has its own date and amount distributions which really quickly becomes a lot of statistical fits. Our dataset only had roughly 40 transaction categories and, in the case of a dataset with much more categories, this will become extremely tedious. Another weakness, as mentioned, is that the synthetic data will only ever be as ‘good’ for training as the amount of statistical parameters that one is able to come up with.
Please see an example output for a given client below:
Figure 1: Synthetic client data generated using the original hand-coded model