FAQs – JTA

What is ETL?

We load data to a store using ETL. It is a series of steps to collect data and to transform it according to business rules. These are the three steps:

Extraction. In other words, taking data from the source systems and importing it into a staging area. Each data source has its own set of characteristics that need to be managed.
Transformation. In other words, cleaning and other procedures applied to the data to obtain accurate, complete, and unambiguous data.
Loading. In other words, data is written from the staging area into the databases or warehouses.

Why use ETL?

ETL is the most effective approach to provide fast access to information. It allows organizations to analyze data that resides in multiple locations in a variety of formats. It increases efficiency and drives better business decisions.

What is the Best Tool for ETL?

There are several tools available. However, at JTA, we believe that using R programming instead of other classic ETL tools, provides significantly better data manipulation and is more efficient.

You might be interested to read the Wikipedia article on ELT which you can find here.

If you would like to know some more then read about How JTA The Data Scientists does its work or have a look at some other FAQs.

You could also explore our case studies or whitepapers.

What is a Data Repository?

A data repository is a collection of databases that manage and store varying data sets for analysis, sharing and reporting.

There are many different ways to store data that could all be described a data repository:

Data warehouses. See our FAQ What is a Data Warehouse? for more information.
Data lakes. See our FAQ What is a Data Lake? for more information.
Metadata repositories
Data cubes stored using OLAP technology. See our FAQ What is OLAP? for more information.

If you would like to know some more then read about How JTA The Data Scientists does its work or have a look at some other FAQs.

You could also explore our case studies or whitepapers.

What is a Data Warehouse?

A data warehouse is a repository of information. It can hold logs, internal data or external data. The records represent events or facts of a current or past period.

The benefits of using a Warehouse

Although it takes considerable time to design and implement a Data Warehouse, there are several benefits:

Enhanced Business Intelligence. Data from multiple sources in a single database, enable a central view across the organization.
Time efficient. Since users can quickly access data from several sources, they can rapidly make informed decisions. Besides that, executives are empowered and can query the data themselves with little or no support from IT, hence saving more time and money.
Enhanced Data Quality and Consistency. Providing consistent descriptions, standards or even fixing incoherent or missing data, will improve data quality and consistency.
Historical Intelligence. A data warehouse stores large amounts of historical data, hence allowing the possibility for users to analyze trends and make future predictions.
High Return of Investment. Organizations that have implemented data warehouses and complementary BI systems can generate more revenue and save more money when compared to organizations that haven’t invested.

If you would like to know some more then read about How JTA The Data Scientists does its work or have a look at some other FAQs.

You could also explore our case studies or whitepapers.

What is OLAP?

OLAP or online analytical processing is a technology to organize data. Its aim, therefore, is to support understanding and to create reports.

It is simply a way of making a new copy of the business data and storing it in a cube. The cube stores the data differently. It is stored in a way that is optimized for reporting. Creating a new copy of the data means that reporting work won’t impact transactional systems.

Online Analytical Processing is the technology behind many Business Intelligence applications. It allows users to analyze data in multiple dimensions, and provide the insight and understanding they need for better decision making.

OLAP

OLAP technology is one part of a larger ecosystem. Data comes from a warehouse into the OLAP system. Subsequently, data flows from the OLAP system to mining and visualization tools.

You can read the Wikipedia article on online analytical processing here.

If you would like to know some more then read about How JTA The Data Scientists does its work or have a look at some other FAQs.

You could also explore our case studies or whitepapers.

Can we use databases for data science?

Data Science uses databases but there are other more modern options. Examples include, data lakes and data warehouses. It can be confusing to know what to do.

The main difference between a warehouse, a lake and a database is easy to explain. A relational database stores and organizes structured data from a single source. For example, a transactional system. By comparison data warehouses hold structured data from multiple sources. Data lakes differ from both in that they store unstructured, semi-structured and structured data.

Additionally, databases are strictly controlled. They have to be like this to guarantee that they don’t make mistakes in processing transactions. For example, a database must always be able to reverse a transaction and, in the event of a power failure, recover perfectly. These are great features but they add complexity to the system. When we experiment with data we don’t want this complexity as it can slow down the work. Lakes are much less controlled.

Relational databases are easy to build. However, relational databases don’t support unstructured data, or the vast amount of data being generated today. Hence the emergence of the data warehouse and data lake options.

We still need databases for data science, however. For example, in JTA we use databases to store Master Data and to help us with data cleaning. We also store the nicely structured output in a database before we generate reports.

If you would like to know some more then read about How JTA The Data Scientists does its work or have a look at some other FAQs.

You could also explore our case studies or whitepapers.

What is a Data Lake?

A Data Lake (“DL”) is storage that can store large amounts of data. It stores every type of data in its native format with no fixed limits on size or number of files.

A data lake can hold structured data such as rows and columns from a relational databases. It can also hold semi-structured data, for example, CSV, logs, XML and JSON. Finally it can store unstructured data, for example, emails, documents, PDFs and binary data like images, audio and video. Current DL solutions include Azure Data Lake, Amazon S3’s cloud storage services or Apache Hadoop’s distributed file system.

Benefits of Using a Lake?

Data richness. Ability to store many sources and types. For example, text, audio, images and video.
Data Democratization. This is due to the lake making data available to the whole organization.
Storage in native format. A lake doesn’t need modeling when data is loaded. Instead the data is molded when being explored for analytics. Consequently, lakes offer flexibility to ask business questions and to gain insight.
Scalability. Lakes offer scalability at a modest price when compared to a traditional data warehouse.
Advanced Analytics A lake links large amounts of data to deep learning algorithms. As a result it helps with real-time decisions.
Complementary to existing data warehouse. Warehouses and lakes can work together resulting in an integrated data strategy.

How do Warehouses Compare to Lakes?

Depending on the requirements, an organization may require a data warehouse or a data lake or both. They serve different needs.

Characteristics	Traditional Data Warehouse	Modern Data Lake
Type of Data	Relational data from transactional systems, databases, and business applications.	Non-relational and relational data from many sources. For example, IoT devices, web sites, mobile apps, social media, and others.
Schema	Designed prior to the warehouse implementation.	Written at the time of analysis.
Price Performance	Medium speed query results using high cost storage.	Query results faster due to using low-cost storage.
Data Quality	Highly curated data that serves as the one version of the truth.	Any data that may or may not be curated.
Users	Business analysts.	Data scientists, Data developers, and Business analysts.
Analytics	Batch reporting, BI and visualizations.	Machine learning, predictive analytics, data discovery and profiling.

If you would like to know some more then read about How JTA The Data Scientists does its work or have a look at some other FAQs. You might also like to read Wikipedia’s article on Data lakes

You could also explore our case studies or whitepapers.

Big Data is mentioned a lot. What exactly is it?

Big Data is more than just a large volume of data. It is a technology that allows you to capture, store, process, analyze and discern value. For example, Big Data allows one to acquire new knowledge at high speed.

The main characteristics inherent in Big Data are volume, variety and velocity. We call these three characteristics the three Vs:

Volume refers to the quantity of generated and stored data
Variety refers to the type and nature of the data, and
Velocity refers to the high speed at which the data is processed

However, there are researchers who claim that the three Vs are a too simplistic view of the concept. Possible new Vs are:

Veracity which refers to data quality and value, and
Value which refers to the economic value of the data

All industries have applications for big data.

If you would like to know some more then read about How JTA The Data Scientists does its work or have a look at some other FAQs.

You could also explore our case studies or whitepapers.

R Versus Python: Which is better for data analysis?

The choice of R versus Python is largely academic. At JTA we prefer to use R although both languages are perfectly acceptable. There are a few differences between the two which we can summarize here:

R has a much more extensive library of statistical packages and specialized techniques.
You can find R packages for a wide variety of disciplines, from Finance to Medicine to Meteorology.
Python is a general-purpose programming language, which can be used to write websites and applications whereas R is a Data Science tool.
R builds in data analysis functionality by default, whereas Python relies on packages.
Python currently has more packages for deep learning although this is changing.
R is better for data visualization with plotting being more customizable.
R is being integrated in a lot of mainstream products such as SQL and Power BI.

We also recommend using Microsoft Open R because of its multi threading features.

If you would like to know some more then read about How JTA The Data Scientists does its work or have a look at some other FAQs.

You could also explore our case studies or whitepapers.

FAQ's

Select a category below

Technology

Why use ETL?

What is the Best Tool for ETL?

The benefits of using a Warehouse

Benefits of Using a Lake?

How do Warehouses Compare to Lakes?

See how we can make your data speak

Send an enquiry to us below

Get in touch today to discuss your data related requirements