How and why we built Kadabra: a deep learning algorithm for detection of form fields
Table of Contents
When it comes to customer-facing processes, enterprises know that they need to go digital. But for many traditional businesses, including insurance and banking, it is often easier said than done.
At EasySend, our mission is to transform the way financial service enterprises communicate with their customers. And we do this by transforming paperwork (which is still often handled with actual paper or PDF forms) into fully digital, responsive and user-friendly digital journeys.
Can we use AI to help our users convert any PDF into a digital process without technical knowledge, and most importantly, in a matter of minutes? It turns out we can.
Here is how we harnessed the power of AI to help us with our mission.
Building an AI with the focus on user-experience: the reasoning behind Kadabra
Why we decided against the black-box approach
When we set out to create our AI, our first approach was to build a completely stand-alone, “black-box” algorithm that simply solves the problem with no user interaction. Usually, when designing algorithms, this is considered the best possible option, right?
Initially, we felt like going down this road as well, but we quickly changed our minds because we felt that something fundamental was lacking. Namely – user interaction. In the end, we decided to do things a little differently and we built an algorithm that combines the power of AI with an interactive UI to keep the user engaged with the process.
User interaction as the core design principle behind Kadabra
I would argue that excluding the user entirely from the process has severe disadvantages when it comes to user experience. Especially when it comes to tinkering with core business processes. The user’s involvement with the algorithm is essential – this mantra became our core design principle.
Let’s be honest, AI algorithms always make a few mistakes here and there. The question is how the users are experiencing these mistakes. Our approach got us to build an engaging tool that gets the job done quickly, and at the same time doesn’t frustrate the user, and keeps the user in control. We believe that this approach creates better results than a completely humanless algorithm.
In the end, humans need to be a part of the process. We don’t need algorithms to fully replace humans, we build our algorithms to cooperate with human beings to enhance their experiences. This is the whole idea behind Kadabra.
The technical overview of EasySend Kadabra
Building the dataset and pre-processing
Our task was to develop a system that can quickly convert PDF documents into interactive user journeys with minimal effort on the part of the user. To tackle the challenge in front of us, we decided to leverage computer vision, machine learning, and deep-learning technologies.
The first thing we had to figure out was how to build a training set – a labeled data to feed our algorithm. We couldn’t use an existing dataset because our detection scenario isn’t a common one (so no cats and dogs here).
Luckily for us, field labeling is already an integral part of the PDF specification. We scraped the web for PDF files, and parsed them to search for tagged fields. If we found such fields, we added them to our dataset, which now included positions and types of form fields, already tagged and verified by humans. Excellent work, humans!
Next, we ran some basic image cleanups, which mostly included converting the images to grayscale and inverting the colors. The goal was to free the algorithm from generalizing certain aspects of the data, which we could just normalize beforehand. We tried other image-processing methods, but they didn’t seem to make a big difference.
Training the model: the good, the bad and the ugly
Our first approach was to naively feed all the images and labels to a deep neural network and train it to detect the positions and classify the types of fields it finds. Our input was an image of a PDF page, and the output would be all the fields that were detected, as well as their position, type and size.
We used a sliding-window approach to slice the large page into smaller patches, checked whether those patches had a field in them or not, and then we fed that data into the network. Of course, we had to balance the amount of field vs. non-field patches. The learning problem was: given a small image, does it contain a field, and if so, where?
We wanted our network to learn: (1) if a field exists in a given image, and (2) if it exists, what are the boundaries of the field.
What is our algorithm learning? The green box is the target; the redbox is the AI making detections and improving upon itself.
This approach seemed promising at first, but we ran into several issues:
- Fields come in many different sizes, so we needed multiple patch sizes. As a result, the training and the running of the algorithm was pretty slow and we could never really find the sweet spot. We didn’t want our users to wait too long for the processing of each page.
- The algorithm was learning very fast initially, but we couldn’t reach the accuracy we wanted, resulting in detections that, even if correct, were always just “slightly off”. In most object detection scenarios this would not be an issue, but our use-case requires pixel-perfect accuracy. Every error is very clearly noticed. We were not alone in facing this issue, Amazon Textract seems to be running into similar problems.
When we thought about our next step, we realized that we are spending a lot of expensive computation time on detecting the exact position of the field, and we were forcing our algorithm to learn both position and classification. We knew that there are deep-learning algorithms that are designed for boundary detection, but we wanted a simpler, less black-boxed solution. We ended up taking a classic approach to search for candidates, followed by simple yes-field/no-field classification.
Search & classify
Our problem space is unique because we roughly know what a field looks like – it is a box, or a line, or a dotted line, or something like that. It has some geometric properties that we can always find in a field, and the great computer vision algorithms were already invented to find those shapes.
So we used OpenCV to find lines and rectangles and tuned our code until we detected every shape that existed in the original dataset. Of course, we also got many more lines and rectangles which were not fields, but this was intended – now we wanted our algorithm to learn how to distinguish between these two groups. Our original learning problem was changed to: given an image of a line or a rectangle, is it a form field? This is a much simpler problem to train for, and it will never fail us in terms of pixel-accuracy.
The importance of the user-experience
We now wanted to tune the algorithm to create the best possible user experience. We know that it is much easier for a user to remove a falsely-detected field than it is to mark a non-detected field using drag & drop functionality. So, we tweaked the algorithm to generate more predictions, even if some of them were false because this still improved the ease of use for the end-user. These kinds of thoughts got us to our next task. We were still just at the beginning of project Kadabra.
In the next article, I will go into more details about our progress and the lessons we learned along the way.