Business analytics with CouchDB and NoFlo
The purpose of business analytics is to find data from the company's information systems that can be used to support decision making. What customers buy most? What do they do before a buying decision? What are the signs that a customer may be leaving?
For the last month we've been working in Salzburg to build such a system, the Intelligent Project Controlling Tool needed for running large collaborative research projects like IKS. Since the design we went with can be reused for other business analytics needs, I wanted to write a bit about it.
Where does the data come from?
There are many ways to gather business data. Often the information systems already contain the data needed. But it may also be hidden in a jungle of spreadsheets. Or maybe some data is simply not available, and has to be filled in manually.
Handling all these cases in one system is a tricky question. To solve it, we went with a two-layered strategy:
- All data used for analytics is stored as Linked Data in a CouchDB system
- NoFlo workflows are used for gathering data from the diverse sources and convert it to the format needed
In IKS's case, much of the data was available in a series of spreadsheets. With these, we built the necessary workflows for first converting the spreadsheets into XML with Apache Tika, and then extracting the information from them in a sensible subset of JSON-LD.
Because IKS is a collaborative project, information needs to be gathered from a diverse group of partner organizations. Some of them have systems that provide the needed APIs (like Basecamp, which we use), and we can just periodically import the data. But with many we decided on a simple data interchange approach: spreadsheets handled over email.
In this approach, user files a data request into the system. This gets picked up by NoFlo, which sends an email with the appropriate spreadsheet template to the partner. Then it starts waiting for a reply. When a reply arrives, it extracts the data from the attached spreadsheet and imports it to the system.
Our NoFlo processes are mostly initiated by the CouchDB change notification API. We keep them running persistently using forever Node, so whenever some operation needs to be run it happens nearly immediately.
Ensuring data consistency
With any automation, and especially with the email-based data interchange, things can go wrong. Because of this we tag all data that we receive with its origin, whether it was some automated operation or an imported spreadsheet. These origins are called execution documents. Users can browse all completed workflow executions and see what data came in from them. These can then be either accepted or rejected.
This way if some partner accidentally sends faulty data, or something else breaks, the incorrect information received can be easily removed. CouchDB's versioning capabilities help here.
Analyzing the data
If you figure out a new way to look at the data you have, simply write the needed map and reduce functions and save them into the database. CouchDB will then run them against existing data and produce nu
Truncated by Planet PHP, read more at the original (another 2738 bytes)