Enter your search above to see Results

Modern Techniques to Store Your Data

Great data science starts with great data.  Unfortunately, it is not possible to jump right in and look for useful insights. Firstly, you will need to have a robust architecture to prepare and store your data.  This is because the best insights come from joining different sources of data. Your architecture must allow you to do this with ease.

Good data deserves good preparation

Before we can perform a great analysis of data we must prepare that data. The number of techniques which we can apply are many. However, there are some common tasks that are often needed. For example, data cleaning, mapping to business rules, removing outliers, and smoothing our trends.

Consider the size of your data and have room for future growth

The is a lot of confusion about what constitutes big data. Even a simple computer can process huge amounts of information. Unfortunately, there is a tendency to think that a project will need big data techniques. This can be very inefficient and cost a lot of money.

Master Data Is Important

Master Data is a key that will unlock a lot of business problems. For example, it helps you to join internally sourced to data to data that you purchase. Furthermore, it can make the results far easier to understand because they will represent the business' view of the world. Controlling Master Data will also warn you about unexpected changes in your information before they cause an issue.

The Data Platform

Choosing and developing a great data platform is not an easy task. We need to consider many variables. We can only make a decision after careful consideration. Unfortunately, making the wrong choice can be a very expensive mistake, and can be very difficult to put right. Here are just some of the factors which will influence your final decision:

  1. How does the business store data at present?
  2. Can the current mechanisms support the business’s data science strategy?
  3. Should we separate the data science repository from the legacy systems?
  4. Will we resort to buying data from an outside supplier and merging this data with our internal information?
  5. Where is our business on the data science evolutionary scale?
  6. Do we have data scientists who can work with tricky data, if not, we may need to design simple to use tools for our teams to use?

Our Recommended Approach

JTA understands that selecting or developing a data platform, specifically for data science is not easy.  For this reason, we employ specialists in data platforms who understand the engineering requirement behind the decisions.  Incidentally, you can read more about how our teams are structured on our How We Do It page.

We recommend developing all the data loading and data preparation tasks using an appropriate tool. Specifically, the tool must be flexible and easy to review.  We often chose to use R for this type of work as it can do all the basic tasks of data loading. R scripts are also easy to read and to keep under source control.  However, when a business needs to do something more complex with data then R is able to apply sophistication to the task.  For example, R allows us to use many of the excellent libraries to draw on knowledge that is freely available to us.  Indeed, we recommend that this open source knowledge is tailored to your business and added to an internal library of data preparation code.

Having a code library allows a business to use much more sophisticated data preparation tasks with ease. For example, weighting your data to be more representative of the market, recognizing and separating the effects of seasonality from the data, filtering outliers and other noise from the information as well as other statistical processes.

Control Frameworks

When data moves or changes there is risk. Consequently, a business must not only understand this risk but must be able to control it.  Broadly speaking, there are two type of control that we consider.  Firstly, we consider the class of prevent controls.  Prevent controls stop errors before they occur. We prefer these prevent controls but on their own they are often not sufficient. Secondly we have controls that work in conjunction with the prevent controls. We call these detect controls.  They don’t stop an error from happening but they can detect when something has gone wrong and warn us.


The Secret Recipe to Success

1
Control Framework
It is important to build-in controls thus preventing errors for happening or detecting when errors or other irregularities have occurred.
2
Master Data
Being able to store and control you master data is very important
3
Use New Technology
Don't always consider a database to be the best solution Investigate modern options such as data lakes

Frequently Asked Questions

What is ETL?

We load data to a store using ETL.  It is a series of steps to collect data and to transform it according to business rules.  These are the three steps:

  1. Extraction. In other words, taking data from the source systems and importing it into a staging area. Each data source has its own set of characteristics that need to be managed.
  2. Transformation. In other words, cleaning and other procedures applied to the data to obtain accurate, complete, and unambiguous data.
  3. Loading. In other words, data is written from the staging area into the databases or warehouses.

Why use ETL?

ETL is the most effective approach to provide fast access to information. It allows organizations to analyze data that resides in multiple locations in a variety of formats. It increases efficiency and drives better business decisions.

What is the Best Tool for ETL?

There are several tools available. However, at JTA, we believe that using R programming instead of other classic ETL tools, provides significantly better data manipulation and is more efficient.

You might be interested to read the Wikipedia article on ELT which you can find here.

If you would like to know some more then read about How JTA The Data Scientists does its work or have a look at some other FAQs.

You could also explore our case studies or whitepapers.

Big Data is mentioned a lot. What exactly is it?

Big Data is more than just a large volume of data. It is a technology that allows you to capture, store, process, analyze and discern value. For example, Big Data allows one to acquire new knowledge at high speed.

The main characteristics inherent in Big Data are volume, variety and velocity. We call these three characteristics the three Vs:

  • Volume refers to the quantity of generated and stored data
  • Variety refers to the type and nature of the data, and
  • Velocity refers to the high speed at which the data is processed

However, there are researchers who claim that the three Vs are a too simplistic view of the concept.   Possible new Vs are:

  • Veracity which refers to data quality and value, and
  • Value which refers to the economic value of the data

All industries have applications for big data.

If you would like to know some more then read about How JTA The Data Scientists does its work or have a look at some other FAQs.

You could also explore our case studies or whitepapers.

Can we use databases for data science?

Data Science uses databases but there are other more modern options.  Examples include, data lakes and data warehouses.  It can be confusing to know what to do.

The main difference between a warehouse,  a lake and a database is easy to explain.    A relational database stores and organizes structured data from a single source.  For example, a transactional system.  By comparison data warehouses hold structured data from multiple sources.  Data lakes differ from both in that they store unstructured, semi-structured and structured data.

Additionally, databases are strictly controlled.  They have to be like this to guarantee that they don’t make mistakes in processing transactions.  For example, a database must always be able to reverse a transaction and, in the event of a power failure, recover perfectly.  These are great features but they add complexity to the system.  When we experiment with data we don’t want this complexity as it can slow down the work.  Lakes are much less controlled.

Relational databases are easy to build. However, relational databases don’t support unstructured data, or the vast amount of data being generated today.   Hence the emergence of the data warehouse and data lake options.

We still need databases for data science, however. For example, in JTA we use databases to store Master Data and to help us with data cleaning.  We also store the nicely structured output in a database before we generate reports.

If you would like to know some more then read about How JTA The Data Scientists does its work or have a look at some other FAQs.

You could also explore our case studies or whitepapers.

Related Case Studies


Our Latest Testimonials

The online solutions provided by JTA are very powerful and dynamic. Working with JTA is always a pleasure and a trouble-free experience.  Consequently, our relationship with JTA is very collaborative.  JTA always handles any challenges promptly and without stress.

Raymond Piombino, Founder, Bordeaux Consultants International

JTA can be trusted to get the job done right, on budget, and on time. I literally NEVER worry about errors when working with them – they are very thorough, has an excellent eye for details, and are pleasant to work with. I look forward to working with them again in the future.

Jim McDonell, Analyst

Which industries use our innovative approach to Data Platforms?

We work across several sectors providing data solutions that inform and enhance your business. Whether it’s confidential financial information, life-saving medical records or software to improve the gaming experience of your users, our dedicated team are agile enough to deal with all needs.

Enquiry

See how we can make your data speak

Send an enquiry to us below

reCAPTCHA