Categorize Banking Transactions with Machine Learning

In this post I introduced my workflow to manage my spendings using Python’s Pandas module. In order to extract meaningful insights from the banking transactions, it is crucial to categorize them. So far, I had to categorize every new transaction manually. However, since I have a fairly large dataset of already categorized transactions by now, I wanted to use Machine Learning to help me with that.

I knew how Machine Learning works in principle, but I never actually applied it before. Since I wanted to derive the categories from the transaction subject texts, I specifically looked for applications in natural language processing. After some research I found this article about spam detection, which I could adapt to my problem.

The simplest method to derive meaning from language is the bag-of-words approach, where the frequency of individual words is used to classify texts. But first, the transaction subject texts have to be cleaned up. This is how a typical subject looks like (all spaces exactly as they are shown):

Auftraggeber: PayPal (Europe) S.a r.l. et Cie, S.C.A. Buchungstext: 1 PP.9.PP . Urban Sp orts GmbH, Ihr Einkauf bei Urban Sp orts GmbH Ref. IB2/5

I use a series of commands to clean up the text:

text = text.lower()
text = re.sub('[^a-z ]',' ',text)
text = text.split()
text = {word for word in text if len(word)>=4}
text = ' '.join(text)

First, all letters are converted to lower case. Then, a regular expression is used to remove all non alphabetical characters. Short “words” with less than 4 letters, excess spaces and duplicate words are also removed. This is the result:

urban paypal buchungstext auftraggeber gmbh orts europe einkauf

There are still some errors, such as “orts” instead of “Sports” due to a misplaced space in the original subject text. But since Machine Learning is always stochastical, this is good enough for now.

The next step is to transform the subject texts into numbers, which can be fed to a Machine Learning algorithm. This process is called vectorization. First, all different words in the data are determined and form a vocabulary. I only considered the most frequent 500 words in all the subject texts. Then, for every individual text, a vector is created, whose elements are the frequencies of each word of the vocabulary in this particular text. So a typical vector for one subject text has a few entries which are 1 and mostly zeros elsewhere.

I used the Python Machine Learning library scikit-learn, which conveniently offers this functionality:

vectorizer = CountVectorizer(max_features=500)
X = vectorizer.fit_transform(texts)

X is a matrix, in which every row is a word frequency vector for one subject text. The machine learning algorithm also needs to be supplied with corresponding y-data, namely the matching categories for those subject texts. Then, the classifier can be trained with this data:

y = categories
classifier = RandomForestClassifier()
classifier.fit(X,y)

Finally, the classifier can be used to categorize new transactions. Their subject texts have to be cleaned up and vectorized in the same way as the training data:

X_new = vectorizer.transform(text)
category = classifier.predict(X_new)

Now, the transform method of the vectorizer is used instead of fit_transform. This is because the vocabulary from the training data should be used to vectorize the new texts rather than creating a new set of words.

In order to test the trained classifier, a part of the already categorized data is separated from the training data before the classifier is trained. This data is treated as new and run through the classifier. The predicted categories can then be compared to the real ones.

With this setup I achieved an accuracy of 70%. I probably could improve it with better pre-processing of the data, more sophisticated methods or tweaking of the parameters. But I am already quite happy with this result. Instead of manually defining all categories, I only have to correct the wrong ones. I created a command line interface to speed up the manual correction.

Since the training of the classifier takes less than a second, I train it again every time I import new transactions. In this way, I expect the accuracy to increase, as I acquire more and more correct data.

In this post I made the code examples short and concise to make the process easier to follow. A full code example can be found in this notebook.

2 responses to “Categorize Banking Transactions with Machine Learning”

Leave a comment

Design a site like this with WordPress.com
Get started