We have presented here some answers to common questions that we hear from our clients, staff and partners. You are welcome to ask us any other questions that you may have at enquiries@thedatascientists.com.
There’s no dress code. However, we’re not expecting an employee to show up in beach clothes and flip-flops!
The general rule is that the closer you are to the client the nicer you dress. It is always good to look smart and well dressed with clients, however that does not meas formal dress such as suits.
We load data to a store using ETL. It is a series of steps to collect data and to transform it according to business rules. These are the three steps:
ETL is the most effective approach to provide fast access to information. It allows organizations to analyze data that resides in multiple locations in a variety of formats. It increases efficiency and drives better business decisions.
There are several tools available. However, at JTA, we believe that using R programming instead of other classic ETL tools, provides significantly better data manipulation and is more efficient.
You might be interested to read the Wikipedia article on ELT which you can find here.
If you would like to know some more then read about How JTA The Data Scientists does its work or have a look at some other FAQs.
You could also explore our case studies or whitepapers.
A data repository is a collection of databases that manage and store varying data sets for analysis, sharing and reporting.
There are many different ways to store data that could all be described a data repository:
If you would like to know some more then read about How JTA The Data Scientists does its work or have a look at some other FAQs.
You could also explore our case studies or whitepapers.
Data visualization is the graphical and pictorial representation of information and data. Using tables, graphs and maps, data visualization tools provide a comprehensive method to understand trends, correlations or patterns in data.
Data proliferation has made it difficult to manage and benefit from it. Data visualization is essential to portray this massive amount of information and make data-driven decisions. Of course, data is only as good as your ability to understand and communicate it, which is why choosing the right visualization is essential.
There are several benefits to using visualization, including:
If you would like to know some more then read about How JTA The Data Scientists does its work or have a look at some other FAQs.
You could also explore our case studies or whitepapers.
A data warehouse is a repository of information. It can hold logs, internal data or external data. The records represent events or facts of a current or past period.
Although it takes considerable time to design and implement a Data Warehouse, there are several benefits:
If you would like to know some more then read about How JTA The Data Scientists does its work or have a look at some other FAQs.
You could also explore our case studies or whitepapers.
OLAP or online analytical processing is a technology to organize data. Its aim, therefore, is to support understanding and to create reports.
It is simply a way of making a new copy of the business data and storing it in a cube. The cube stores the data differently. It is stored in a way that is optimized for reporting. Creating a new copy of the data means that reporting work won’t impact transactional systems.
Online Analytical Processing is the technology behind many Business Intelligence applications. It allows users to analyze data in multiple dimensions, and provide the insight and understanding they need for better decision making.
OLAP technology is one part of a larger ecosystem. Data comes from a warehouse into the OLAP system. Subsequently, data flows from the OLAP system to mining and visualization tools.
You can read the Wikipedia article on online analytical processing here.
If you would like to know some more then read about How JTA The Data Scientists does its work or have a look at some other FAQs.
You could also explore our case studies or whitepapers.
Data Science uses databases but there are other more modern options. Examples include, data lakes and data warehouses. It can be confusing to know what to do.
The main difference between a warehouse, a lake and a database is easy to explain. A relational database stores and organizes structured data from a single source. For example, a transactional system. By comparison data warehouses hold structured data from multiple sources. Data lakes differ from both in that they store unstructured, semi-structured and structured data.
Additionally, databases are strictly controlled. They have to be like this to guarantee that they don’t make mistakes in processing transactions. For example, a database must always be able to reverse a transaction and, in the event of a power failure, recover perfectly. These are great features but they add complexity to the system. When we experiment with data we don’t want this complexity as it can slow down the work. Lakes are much less controlled.
Relational databases are easy to build. However, relational databases don’t support unstructured data, or the vast amount of data being generated today. Hence the emergence of the data warehouse and data lake options.
We still need databases for data science, however. For example, in JTA we use databases to store Master Data and to help us with data cleaning. We also store the nicely structured output in a database before we generate reports.
If you would like to know some more then read about How JTA The Data Scientists does its work or have a look at some other FAQs.
You could also explore our case studies or whitepapers.
A Data Lake (“DL”) is storage that can store large amounts of data. It stores every type of data in its native format with no fixed limits on size or number of files.
A data lake can hold structured data such as rows and columns from a relational databases. It can also hold semi-structured data, for example, CSV, logs, XML and JSON. Finally it can store unstructured data, for example, emails, documents, PDFs and binary data like images, audio and video. Current DL solutions include Azure Data Lake, Amazon S3’s cloud storage services or Apache Hadoop’s distributed file system.
Depending on the requirements, an organization may require a data warehouse or a data lake or both. They serve different needs.
Characteristics | Traditional Data Warehouse | Modern Data Lake |
Type of Data | Relational data from transactional systems, databases, and business applications. | Non-relational and relational data from many sources. For example, IoT devices, web sites, mobile apps, social media, and others. |
Schema | Designed prior to the warehouse implementation. | Written at the time of analysis. |
Price Performance | Medium speed query results using high cost storage. | Query results faster due to using low-cost storage. |
Data Quality | Highly curated data that serves as the one version of the truth. | Any data that may or may not be curated. |
Users | Business analysts. | Data scientists, Data developers, and Business analysts. |
Analytics | Batch reporting, BI and visualizations. | Machine learning, predictive analytics, data discovery and profiling. |
If you would like to know some more then read about How JTA The Data Scientists does its work or have a look at some other FAQs. You might also like to read Wikipedia’s article on Data lakes
You could also explore our case studies or whitepapers.
The benefits of big data can differ by industry. There are, however, common benefits from using big data. For example, lower cost, reduced time and better competitive advantage. Other benefits which may be possible include:
Unfortunately, there are also challenges with big data:
Big data comes from a lot of different places. Applications, social media, email, employee-created documents and others. It is very difficult to combine all that data effectively. Unfortunately, most machine analysis algorithms expect homogeneous data to work properly.
Big Data usually has information from many sources. Furthermore, the sources may be of varying reliability. Much of that data is unstructured, meaning that it doesn’t come from a database. Documents, photos, audio, videos and other unstructured data can be difficult to analyze.
As data grows in volume we need real-time techniques to decide what should be stored. It is often not economically viable to store all the raw data. Companies must be good at curating their data.
Many organizations are still new to big data. The skill set is not the same as that for business intelligence and data warehousing, for which most organizations have developed their skills.
Managing privacy effectively is both a technical and a sociological problem. Also, the value of the data owned by an organization becomes important. Organizations are concerned with how to leverage this data, while keeping their data advantage. Questions such as how to sell data without losing control are becoming important.
If you would like to know some more then read about How JTA The Data Scientists does its work or have a look at some other FAQs.
You could also explore our case studies or whitepapers.
Big Data is more than just a large volume of data. It is a technology that allows you to capture, store, process, analyze and discern value. For example, Big Data allows one to acquire new knowledge at high speed.
The main characteristics inherent in Big Data are volume, variety and velocity. We call these three characteristics the three Vs:
However, there are researchers who claim that the three Vs are a too simplistic view of the concept. Possible new Vs are:
All industries have applications for big data.
If you would like to know some more then read about How JTA The Data Scientists does its work or have a look at some other FAQs.
You could also explore our case studies or whitepapers.