Learning objectives

With the explosion of data, numerous challenges have arisen in how to store, access and secure them, without compromising on performance and productivity. Due to those requirements and constraints, database technologies have made great strides in recent years, focusing on analytics work.

This module will introduce you to the fundamentals of SQL and using it with databases. By the end of this module you’ll:

  1. Understand why we need databases.
  2. Learn the fundamental SQL commands to get and subset data.
  3. Learn how to access the SQL from within other tools.

Definitions and motivating factors

What is a database?

We’ll be paraphrasing the definition supported by Oracle. This is one of the oldest companies in the field and they have been participating in the design and development of databases for decades:

A database is an organized collection of structured information, or data, typically stored electronically in a computer system.

What is Structured Query Language (SQL)?

SQL is a programming language used by nearly all relational databases to query, manipulate, and define data, and to provide access control.

The motivating factors for using a database and SQL are:

  • Data storage. We don’t need to talk in detail about the huge volumes of data becoming available for analysis every day. All this information has to be stored in a proper way, to ensure it is not lost, and can grow properly. Databases allow efficient storage of such data.
  • Data access. It is not enough to just have the data stored in the right way - it has to be available for downstream work. Databases serve as a single source of truth, and provide interfaces (especially through SQL) to be queried, and written to, by other tools and frameworks. Workers from different teams are able to efficiently work with their data.
  • Security. Modern databases have all sorts of mechanisms that protect the data from accidental (or malevolent) over-writing, copying and deletion. Those tools can help organizations make sure their data is secure.
  • Scalability and performance. Working with databases allows data scientists and engineers avoid loading the complete data into memory. They can just select the data they need for the analysis.

Database Tutorial for Beginners (5 minutes)

The definitions of database and SQL given above specified that we have structured data in a relational database.

Structured data can most easily be understood as data which is organized according to some rules. For example, data in a tabular format with column headers for each variable and a row for each observation is structured data. Unstructured data is anything that contains information but can be in any form of its given format, like pictures, video, audio or text documents. Sometimes you’ll see other file formats that contain data, such as JSON or XML, popular on the web, described as structured data. Although these formats are defined by rules, they could be better described as semi-structured data.

A relational data base is just one type of database, each with pros and cons:

  • Hierarchical. There’s a ranking system, and the data is split into parent-child relationships.
  • Network. There is very loose structure, and there are bi-directional relationships between the entities.
  • Object-oriented. Every piece of information is stored as an object.
  • Non-relational. All data points can be stored in different formats that are loosely coupled, without an enforced database schema (we’ll cover what schema is later in the module).
  • Relational. The data is stored in tables. Observations among the tables are linked by a series of unique identifiers. The diagram below is from wikipedia. Can you make out what information the tables contain and how they are linked?

Relational databases are the most common and accessible type, thanks to an SQL interface. Knowledge of SQL will help you start working with different systems later on.

How to pronounce SQL? There’s no right answer to this question, but there are two correct options. First, is to spell the letters S.Q.L. The second option is to say sequel - and that one is arguably easier on the ear.

Blended learning: NoSQL Databases

There are several fundamentally different ways in which data can be stored. The one most common that we will be using is the relational model, but another one of great interest to data scientists is NoSQL. Contrary to popular belief, NoSQL stands for Not-only SQL.

The most popular No-SQL database is MongoDB. The Data Wrangling with MongoDB on Udacity provides a good introduction if you are interested in learning more.

Go through this tutorial to understand the difference between SQL and NoSQL (1 hour).

Blended learning: Designing Data-Intensive Applications

One of the most influential books in recent years, that many data engineers have on their desks is Designing Data-Intensive Applications. A lot of the concepts introduced there might be a bit too advanced for beginners, but it can be used as a reference guide, and a useful entry point for those who want to specialise in this domain.

Getting started with databases locally

The best way to learn to work with databases initially is to set them up locally, instead of using pre-built cloud solutions which are already configured. An excellent tutorial is available here (1 hour).

Install MySQL

The specific database type we’ll use if called MySQL. MySQL is an open source Relational Database Management System (RDBMS) that is based on the language SQL. The MySQL installation guide is the best source for instructions on how to install it on your specific system.

To test whether your installation worked, type which mysql in your command line after. This command should return an output without an error.

Database IDEs

MySQL comes packaged with MySQLWorkbench, an IDE for writing and executing SQL commands. For our purposes this is perfectly fine, but there are several other tools that you can try out.

DataGrip

DataGrip, shown below, is commercial software from JetBrains. It powerful, looks sleek and is recommended if you’re using other software from JetBrains, like the Python IDE PyCharm. It has very excellent integration with advanced tools, such as Docker.

DBeaver

DBeaver, see below, is an open source alternative to DataGrip with great functionality.

First Data

In order to practice working with databases we need some sample data. Luckily for us, the official developers for MySQL are providing exactly that - the Employees Sample Database. As a preparatory step go ahead to the website and follow the instructions to download and validate the data.

If for some reason you can’t manage to set up the employees database, you can use the toy database which is built in into DBeaver. Feel free to test the commands from the employees section there!

Basic commands

Now let’s have a look at the employee titles table, and if the data is loaded correctly. We will use the most common SQL command which allows us to have a look at the first 20 rows of all columns in a single table (here the titles table).

SELECT * FROM employees.titles LIMIT 20;

One valuable piece of information would be to know how many employees we have per job title? This is called a contingency table and we can obtain it by the following command:

SELECT title, count(*)
FROM employees.titles
GROUP BY title;

Another useful tool in SQL is filtering by a string. Perhaps we want to find out how many Engineer employees we have. You can see that we can take the previous command and modify it a little bit and get the results we want. This process is very common in SQL usage, and it is highly recommended that you save[^git] your queries when working on a project (even between projects), since you will be often reusing them.

SELECT COUNT(*)
FROM employees.titles
WHERE title="Engineer"

And finally, how about the number of data engineers that we have from 1996 onwards? We can also filter dates[^dates]:

SELECT title, from_date
FROM employees.titles
WHERE (title="Engineer" AND from_date > 1996-01-01)

Those commands are just the tip of the iceberg of what you can do with SQL. Let’s have a look at some advanced concepts next.

Database schema and data types

You might ask yourself - is there a way to visualise the relationships between all the tables in the database? Looking at them individually is not enough. This structure is called a database “schema,” and is commonly represented by a Entity Relationship Diagram (ERD). All tools for working with databases, including MySQLWorkbench and DBeaver, allow you to create such diagrams automatically. Have a look at the one generated for the employees database:

The boxes represent different tables, and the arrows the relationships between them - based on “keys,” such as employee_no and dept_no. Those can be used to join tables together. Note how those keys are marked in bold. We’ll explore this in a bit. Let’s first cover another important topic - data types.

For the following procedures we will be using a different database which is a bit more complex. This will help us illustrate some of the more advanced database concepts. You should download the data/mysqlsampledatabase.sql file from the module materials. If you open the file it should automatically open up MySQLWorkbench. The alternative is to use the command line:

sudo mysql -t < data/mysqlsampledatabase.sql

If you start looking through the code you will start to understand what it does (hint: it will do an import automatically, different from how we worked on the employees DB).

/*!40101 SET NAMES utf8 */;

/*!40101 SET SQL_MODE=''*/;

/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;

Those initial lines do some housekeeping work. They set the database metadata, with some essential settings, such as quotations format. We will not go much further in detail for those.

CREATE DATABASE /*!32312 IF NOT EXISTS*/`classicmodels` /*!40100 DEFAULT CHARACTER SET latin1 */;

USE `classicmodels`;

These lines create a new database, provided it was not there before (to make sure you don’t override the data by accident). Then the USE verb is used to make it the default database.

DROP TABLE IF EXISTS `customers`;

CREATE TABLE `customers` (
  `customerNumber` int(11) NOT NULL,
  `customerName` varchar(50) NOT NULL,
  `contactLastName` varchar(50) NOT NULL,
  `contactFirstName` varchar(50) NOT NULL,
  `phone` varchar(50) NOT NULL,
  `addressLine1` varchar(50) NOT NULL,
  `addressLine2` varchar(50) DEFAULT NULL,
  `city` varchar(50) NOT NULL,
  `state` varchar(50) DEFAULT NULL,
  `postalCode` varchar(15) DEFAULT NULL,
  `country` varchar(50) NOT NULL,
  `salesRepEmployeeNumber` int(11) DEFAULT NULL,
  `creditLimit` decimal(10,2) DEFAULT NULL,
  PRIMARY KEY (`customerNumber`),
  KEY `salesRepEmployeeNumber` (`salesRepEmployeeNumber`),
  CONSTRAINT `customers_ibfk_1` FOREIGN KEY (`salesRepEmployeeNumber`) REFERENCES `employees` (`employeeNumber`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

The first line of this will drop (delete) the table in the database if it already exists (note that this is an exact inverse of the purpose behind the previous code chunk). The following lines are extremely important since they set the data types for the different columns in the data. For this several verbs are used such as int and varchar, we will go more in detail later.

Another important function of this code chunk is the specification of the table keys, such as PRIMARY KEY, and FOREIGN KEY, those will also be covered separately.

The remainder of the code, after the line:

/*Data for the table `customers` */

Inserts the data in the table. The rest of the tables follow the same loading pattern, so we will not cover that.

In order to store and use data correctly, you need to understand the different standard data types. Those are normally common between different databaase systems. So far probably your main exposure to data types has been in your programming language of choice - R or Python. Even there there are often headaches, since the libraries for those languages for data input (i.e. readr or pandas) try to infer data types automatically to help you, and while that often works, it can also fail and lead to bugs (especially in base R factors).

Unless you are using tools such as Amazon Glue (covered in a later section), you will need to manually set up the data types. An overview on what is available is present in the diagram below:

Important Data Types in MySQL (5 minutes)

Advanced SQL

The first questions that we had were relatively easy to answer and understand, and did not require complex operations across several tables. One of the most typical workflows of data scientists in industry is to do those basic selections and filtering and then work with the data in your language of choice (more on how to use R and Python to work with data from relational databases below), but sometimes we need more complex operations.

Blended learning: Advanced SQL

To learn the advanced concepts go through the following tutorials:

And finally this Kaggle Overview tutorial (2 hours).

Let’s illustrate those with a few questions that a business manager can often come up, and it is the job of the data person to answer:

  • How much is paid in each city?

The first thing that we should do is to determine what type and where we can get the data to answer this question. We should have a look at the ERD available above. We’ll need to connect the payments and customers tables somehow. In SQL those operations are called “joins” . But first - quick look at the ER diagram (remember we can generate it with DBeaver):

Now a quick detour on the possible joins. One might thnk joining two tables is a straightforward process, but in reality there are several ways to do this. Depending on your purposes you might go for one or the other. Those are illustrated in the diagram below:

In our case we’ll be using the inner join.

select p.customerNumber , p.amount, c.city
from classicmodels.payments p
inner join customers c  on p.customerNumber=c.customerNumber

This should combine both tables into one, with the columns connected on the key. An important first concept that you can notice here is the idea of aliasing. Instead of typing out all the table names, we can shortcut them like this classicmodels.payments p. This we can replace the verbose payments with p in other parts of the SQL query. It might be confusing for programming folks, since the aliasing happens after its used.

Back to the analysis. The result from the previous chunk should look like this:

Now as a next step we need to aggregate per city and get the average. We can achieve this by using the avg and group by commands:

select avg(p.amount), c.city
from classicmodels.payments p
inner join customers c  on p.customerNumber=c.customerNumber
group by city

And voila! Here are the results:

We managed to answer our analytical question with SQL.

Additional infromation: BI Software

While we as data scientists often prefer to visualise and analyse data programatically, even in our field sometimes it is useful to use a more specialised GUI. Those tools can be very advanced, and while they are mostly the domain of a data analyst, we can still derive value from them.

Two examples of such tools are Microsoft PowerBI and Tableau.

Using SQL with R

There are several ways to obtain data from a database with R. If you are using RStudio, probably themost intuitive initial option is to use the GUI interface for database connectors shown here:

A second option would be to connect to the database programatically. An example how to do that is shown below. As a first step we need to establish a connection. Here you would need the DBI package (install separately), you should also know the database type.

library(dplyr)
con <- DBI::dbConnect(RSQLite::SQLite(), path = ":dbname:")

We can then use this connection object to authenticate with our credentials:

con <- DBI::dbConnect(RMySQL::MySQL(),
  host = "database.rstudio.com",
  user = "hadley",
  password = rstudioapi::askForPassword("Database password")
)

And finally, let’s use this to get data in a dataframe:

flights_db <- tbl(con, "flights")

From this step onwards we are ready to process the data.

Blended learning: Using dplyr to work with databases

For this example we took snippets of code from the official documentation. In order to really understand how this works, and also appreciate the flexibility of connecting to a database programatically, go through the tutorial yourself (1 hour).

Using SQL with Python

For Python there is no really nice GUI option available as for R in RStudio, but still the process is not much more different. As a first step we need to use the sqlalchemy package and specify some parameters.

import pandas as pd
import sqlalchemy as sql
import pymssql

server = '100.10.10.10'
myQuery = '''SELECT first.Field1, second.Field2
           FROM db1.schema.Table1 AS first
           JOIN db2.schema.Table2 AS second
           ON first.Id = second.FirstId'''

After this we create the engine (pretty much the same as the con object we used in R before).

engine = sql.create_engine('mssql+pymssql://{}'.format(server))

And finally we can use pandas to read directly from the database into a dataframe.

df = pd.read_sql_query(myQuery, engine)

And we’re ready for analysis!

Memory issues

You should keep in mind that any time you are reading from a remote database to your local machine, the data is stored in the RAM memory, thus it is very easy to run out of resources while working.

A good idea is to first use an IDE (like DataGrip, which we covered before) to get your bearings, and then just get the data you need instead of all of it.

Exercises

Portfolio project

Obtain the NYC Motor Vehicle Collisions data (in .json format) from here. Split the data into three tables for each separate BOROUGH, and have the ability to query them with a join.

Project deliverables:

  • stored data in a relational database, in two tables
  • ER diagram of the database schema
  • bonus points for a hosted on the cloud database

Portfolio Project

Take the amazon reviews data from the NLP Module and store it in a PostgreSQL database by using the AWS RDS service. As an additional component access the data programatically either via R or Python (depending on the language you chose in the NLP module).

Project deliverables:

  • data stored in PostgreSQL in the cloud
  • data accessed programatically

Determine the average numbers of orders per customer (classicmodels dataset).

What are the 5 least profitable cities in the data (classicmodels dataset)?

Which products are ordered in the highest quantity (classicmodels dataset)?

What are the 5 least profitable cities in the data (classicmodels dataset)?

What other domain specific databases can you find?

Quizzes

List three reasons for using a database?

How can you use SQL from within R and Python?

What are the two general types of databases?

If you have taken the Fundamentals of Command-line Tools module, could you mention one issue that both CLI tools and SQL solve?

What is the best data type to store a large chunk of text?

What are the two ways you can store Date data to be machine readable?

What are NoSQL databases suitable for?