Enter your search above to see Results

Case Studies

A Data Science Architecture

Data Science Environment

This case study describes a data science environment that we established for a client, wishing to get the most from their data.  They had both internal and external data and they also wanted to marry insights from many different sources.  The corporate market research team led the project which needed to support both established business models and also ad-hoc experimentation with the data.

We started by considering the three main pillars by which we do our work:

  • We start with Data Platforms: How do we load and store our data?
  • Next we consider Data Analysis: How do we run algorithms and search for insights?
  • Finally, Data Reporting: How do we visualize the results from our work?

The task was complex and resulted in quite a complex data science architecture with many tools. We developed these tools specifically for our client’s needs.

Data Science Environment


Data Platform

First Part of the Data Science Environment: The Data Platform

In this case we opted for a reasonably standard data storage solution.  We selected Azure cloud technologies which gave us both flexibility and the ability to grow without worrying about hardware procurement and hosting.  BLOB storage held the new data. This stands for Binary Large Object and is a very simple and cheap way to store unstructured data of any type.  You can read more in Wikipedia here Wikipedia Binary Large Object.

For our ETL processes we decided to write load scripts in R and store them in a source code repository.  We chose R because it is a flexible language that can handle both simple manipulation and highly complex data preparation tasks too.  This means that we can perform a great many complex tasks at the time of loading data.  Here are some examples of tasks performed at load time:

  • Outlier detection and replacement
  • Noise reduction
  • Polynomial smoothing
  • Detection and extraction of the underlying Seasonality

The beauty of a language such as R is that there are a great many libraries available with examples of algorithms. Basing our work on previous research we were able to customize the data cleaning tools specifically for our client.

Master Data Services

Master Data

We then considered how we might manage and control Master Data.  This is the data that defines how the business functions.  It is not usually transactional data but instead it is hierarchies of metadata.  It could be things like customers, suppliers, reporting geography and such like.

Managing master data is a science in itself.  You can read more in the Wikipedia article: Master Data Management.

For this particular project we installed the Microsoft product, Master Data Services (“MDS”).  An SQL database hosts the system and offers a structure to store the master data of a business and also a series of control processes.  For this particular project it was vital, as it allowed us to map different metadata descriptions to a single value.   Having many different sources of data had compounded the problem but MDS allowed us to map everything to one consistent set of business rules.

Control Framework

In order to strengthen our handling of master data we also developed a series of exception reports which would help with data loading.  In particular, we report unknown metadata so that we can investigate and produce a new business rule to cater for this new data.


Data Analysis

Second Part of the Data Science Environment: Data Analysis Engine

For analyzing the data we implemented a variety of technologies.  The core work was again done using R.  The R programs worked in a variety of different roles:

  1. Data Orchestration: R scripts control the movement data to support established business models and reports
  2. Ad-Hoc Analysis: R also helps researchers to perform experiments on the data.  In order to help with this we developed a library of routines to simplify certain tasks and algorithms.
  3. Data Models: The source data fed a set of established market models.  The mathematics behind each model was captured in R so that model refreshes became a simple task of running a script that was driven by parameters.
OLAP Cubes

In addition to R we established OLAP cubes to serve the loaded and structured data. The cubes were also programmed with security mechanisms to control who could see the data.  When these cubes were joined to reports in PowerBI this gave us a mechanism for distributing the results of the business models and other metrics to a very wide audience.


Third Part of the Data Science Environment: Data Reporting

The third part of the architecture was built for data visualization.

PowerBI

We used the cloud based PowerBI platform for the majority of the client’s reporting needs.  The platform was easy to link to our OLAP cubes and also to link directly to the data lake where we could pick up processed data that was not so structure as to demand a cube.

The PowerBI platform has an excellent programming interface which allowed us to build custom widgets for specific visualization needs.  For example, our client used probability distribution in their work and we were able to develop custom charts to show observed data versus the fitted probability curve.

Azure ML

We were also able to connect the data stored in the lake to the cloud based Azure ML platform.  This was not part of the core platform but was an excellent addition to allow the researchers access to additional tools in a visual environment.

Related Case Studies


Enquiry

See how we can make your data speak

Send an enquiry to us below

reCAPTCHA