DATA EXTRACTION

Nowadays a lot of information is still being transferred by paper in form of structured or unstructured documents. Either way, extracting information from such documents is a tedious and costly task for humans. For example, companies get digital as well as scanned invoices, which are processed by employees, extracting a few important data (e.g. the invoice number, total sum, due date, etc.) and typing them into a computer system. Some studies report that more than 80% documents are in the form of hard copies/scanned. Therefore, there is a huge initiative to automatize processes as much as possible.

One of the largest challenges is the creation of a large and clean database. As we are dealing with documents and invoices, we encounter data privacy issues. Because of that, there is a lack of open source databases and models. Therefore, an in-house database is essential.

The goal of this innovative project was the creation of an algorithm with the help of AI to detect and extract contact information from unstructured documents. It was focused on typed documents only. Handwritten documents were out of scope of this project. The algorithm would also classify one of the defined roles (e.g. sender, receiver, mediator, etc.). One of the largest efforts was the creation of a dataset (collection and labeling).

Used tools: Python, Tensorflow