This case study describes a data science environment that we established for a client, wishing to get the most from their data. They had both internal and external data and they also wanted to marry insights from many different sources. The corporate market research team led the project which needed to support both established business models and also ad-hoc experimentation with the data.
We started by considering the three main pillars by which we do our work:
The task was complex and resulted in quite a complex data science architecture with many tools. We developed these tools specifically for our client’s needs.
In this case we opted for a reasonably standard data storage solution. We selected Azure cloud technologies which gave us both flexibility and the ability to grow without worrying about hardware procurement and hosting. BLOB storage held the new data. This stands for Binary Large Object and is a very simple and cheap way to store unstructured data of any type. You can read more in Wikipedia here Wikipedia Binary Large Object.
For our ETL processes we decided to write load scripts in R and store them in a source code repository. We chose R because it is a flexible language that can handle both simple manipulation and highly complex data preparation tasks too. This means that we can perform a great many complex tasks at the time of loading data. Here are some examples of tasks performed at load time:
The beauty of a language such as R is that there are a great many libraries available with examples of algorithms. Basing our work on previous research we were able to customize the data cleaning tools specifically for our client.
We then considered how we might manage and control Master Data. This is the data that defines how the business functions. It is not usually transactional data but instead it is hierarchies of metadata. It could be things like customers, suppliers, reporting geography and such like.
Managing master data is a science in itself. You can read more in the Wikipedia article: Master Data Management.
For this particular project we installed the Microsoft product, Master Data Services (“MDS”). An SQL database hosts the system and offers a structure to store the master data of a business and also a series of control processes. For this particular project it was vital, as it allowed us to map different metadata descriptions to a single value. Having many different sources of data had compounded the problem but MDS allowed us to map everything to one consistent set of business rules.
In order to strengthen our handling of master data we also developed a series of exception reports which would help with data loading. In particular, we report unknown metadata so that we can investigate and produce a new business rule to cater for this new data.
For analyzing the data we implemented a variety of technologies. The core work was again done using R. The R programs worked in a variety of different roles:
In addition to R we established OLAP cubes to serve the loaded and structured data. The cubes were also programmed with security mechanisms to control who could see the data. When these cubes were joined to reports in PowerBI this gave us a mechanism for distributing the results of the business models and other metrics to a very wide audience.
The third part of the architecture was built for data visualization.
We used the cloud based PowerBI platform for the majority of the client’s reporting needs. The platform was easy to link to our OLAP cubes and also to link directly to the data lake where we could pick up processed data that was not so structure as to demand a cube.
The PowerBI platform has an excellent programming interface which allowed us to build custom widgets for specific visualization needs. For example, our client used probability distribution in their work and we were able to develop custom charts to show observed data versus the fitted probability curve.
We were also able to connect the data stored in the lake to the cloud based Azure ML platform. This was not part of the core platform but was an excellent addition to allow the researchers access to additional tools in a visual environment.
The Adoption of Telemetry
How we help a market research team to start analysing telemetry data and to get over the large cultural change needed to adopt big data.