With the explosion of data, numerous challenges have arisen in how to store, access and secure them, without compromising on performance and productivity. Due to those requirements and constraints, database technologies have made great strides in recent years, focusing on analytics work.
This module will introduce you to the fundamentals of SQL and using it with databases. By the end of this module you’ll:
We’ll be paraphrasing the definition supported by Oracle. This is one of the oldest companies in the field and they have been participating in the design and development of databases for decades:
A database is an organized collection of structured information, or data, typically stored electronically in a computer system.
SQL is a programming language used by nearly all relational databases to query, manipulate, and define data, and to provide access control.
The motivating factors for using a database and SQL are:
The definitions of database and SQL given above specified that we have structured data in a relational database.
Structured data can most easily be understood as data which is organized according to some rules. For example, data in a tabular format with column headers for each variable and a row for each observation is structured data. Unstructured data is anything that contains information but can be in any form of its given format, like pictures, video, audio or text documents. Sometimes you’ll see other file formats that contain data, such as JSON or XML, popular on the web, described as structured data. Although these formats are defined by rules, they could be better described as semi-structured data.
A relational data base is just one type of database, each with pros and cons:
Relational databases are the most common and accessible type, thanks to an SQL interface. Knowledge of SQL will help you start working with different systems later on.
How to pronounce SQL? There’s no right answer to this question, but there are two correct options. First, is to spell the letters S.Q.L. The second option is to say sequel
- and that one is arguably easier on the ear.
Blended learning: NoSQL Databases
There are several fundamentally different ways in which data can be stored. The one most common that we will be using is the relational model, but another one of great interest to data scientists is NoSQL. Contrary to popular belief, NoSQL stands for Not-only SQL.
The most popular No-SQL database is MongoDB. The Data Wrangling with MongoDB on Udacity provides a good introduction if you are interested in learning more.
Go through this tutorial to understand the difference between SQL and NoSQL (1 hour).
Blended learning: Designing Data-Intensive Applications
One of the most influential books in recent years, that many data engineers have on their desks is Designing Data-Intensive Applications. A lot of the concepts introduced there might be a bit too advanced for beginners, but it can be used as a reference guide, and a useful entry point for those who want to specialise in this domain.
Getting started with databases locally
The best way to learn to work with databases initially is to set them up locally, instead of using pre-built cloud solutions which are already configured. An excellent tutorial is available here (1 hour).
The specific database type we’ll use if called MySQL. MySQL is an open source Relational Database Management System (RDBMS) that is based on the language SQL. The MySQL installation guide is the best source for instructions on how to install it on your specific system.
To test whether your installation worked, type which mysql
in your command line after. This command should return an output without an error.
MySQL comes packaged with MySQLWorkbench, an IDE for writing and executing SQL commands. For our purposes this is perfectly fine, but there are several other tools that you can try out.
DataGrip, shown below, is commercial software from JetBrains. It powerful, looks sleek and is recommended if you’re using other software from JetBrains, like the Python IDE PyCharm. It has very excellent integration with advanced tools, such as Docker.
DBeaver, see below, is an open source alternative to DataGrip with great functionality.
In order to practice working with databases we need some sample data. Luckily for us, the official developers for MySQL are providing exactly that - the Employees Sample Database. As a preparatory step go ahead to the website and follow the instructions to download and validate the data.
If for some reason you can’t manage to set up the employees
database, you can use the toy database which is built in into DBeaver. Feel free to test the commands from the employees
section there!
Now let’s have a look at the employee titles table, and if the data is loaded correctly. We will use the most common SQL command which allows us to have a look at the first 20 rows of all columns in a single table (here the titles
table).
SELECT * FROM employees.titles LIMIT 20;
One valuable piece of information would be to know how many employees we have per job title? This is called a contingency table and we can obtain it by the following command:
SELECT title, count(*)
FROM employees.titles
GROUP BY title;
Another useful tool in SQL is filtering by a string. Perhaps we want to find out how many Engineer
employees we have. You can see that we can take the previous command and modify it a little bit and get the results we want. This process is very common in SQL usage, and it is highly recommended that you save[^git] your queries when working on a project (even between projects), since you will be often reusing them.
SELECT COUNT(*)
FROM employees.titles
WHERE title="Engineer"
And finally, how about the number of data engineers that we have from 1996
onwards? We can also filter dates[^dates]:
SELECT title, from_date
FROM employees.titles
WHERE (title="Engineer" AND from_date > 1996-01-01)
Those commands are just the tip of the iceberg of what you can do with SQL. Let’s have a look at some advanced concepts next.
You might ask yourself - is there a way to visualise the relationships between all the tables in the database? Looking at them individually is not enough. This structure is called a database “schema,” and is commonly represented by a Entity Relationship Diagram (ERD). All tools for working with databases, including MySQLWorkbench and DBeaver, allow you to create such diagrams automatically. Have a look at the one generated for the employees
database:
The boxes represent different tables, and the arrows the relationships between them - based on “keys,” such as employee_no
and dept_no
. Those can be used to join tables together. Note how those keys are marked in bold. We’ll explore this in a bit. Let’s first cover another important topic - data types.
For the following procedures we will be using a different database which is a bit more complex. This will help us illustrate some of the more advanced database concepts. You should download the data/mysqlsampledatabase.sql
file from the module materials. If you open the file it should automatically open up MySQLWorkbench. The alternative is to use the command line:
sudo mysql -t < data/mysqlsampledatabase.sql
If you start looking through the code you will start to understand what it does (hint: it will do an import automatically, different from how we worked on the employees
DB).
/*!40101 SET NAMES utf8 */;
/*!40101 SET SQL_MODE=''*/;
/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;
Those initial lines do some housekeeping work. They set the database metadata, with some essential settings, such as quotations format. We will not go much further in detail for those.
CREATE DATABASE /*!32312 IF NOT EXISTS*/`classicmodels` /*!40100 DEFAULT CHARACTER SET latin1 */;
USE `classicmodels`;
These lines create a new database, provided it was not there before (to make sure you don’t override the data by accident). Then the USE
verb is used to make it the default database.
DROP TABLE IF EXISTS `customers`;
CREATE TABLE `customers` (
`customerNumber` int(11) NOT NULL,
`customerName` varchar(50) NOT NULL,
`contactLastName` varchar(50) NOT NULL,
`contactFirstName` varchar(50) NOT NULL,
`phone` varchar(50) NOT NULL,
`addressLine1` varchar(50) NOT NULL,
`addressLine2` varchar(50) DEFAULT NULL,
`city` varchar(50) NOT NULL,
`state` varchar(50) DEFAULT NULL,
`postalCode` varchar(15) DEFAULT NULL,
`country` varchar(50) NOT NULL,
`salesRepEmployeeNumber` int(11) DEFAULT NULL,
`creditLimit` decimal(10,2) DEFAULT NULL,
PRIMARY KEY (`customerNumber`),
KEY `salesRepEmployeeNumber` (`salesRepEmployeeNumber`),
CONSTRAINT `customers_ibfk_1` FOREIGN KEY (`salesRepEmployeeNumber`) REFERENCES `employees` (`employeeNumber`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
The first line of this will drop (delete) the table in the database if it already exists (note that this is an exact inverse of the purpose behind the previous code chunk). The following lines are extremely important since they set the data types for the different columns in the data. For this several verbs are used such as int
and varchar
, we will go more in detail later.
Another important function of this code chunk is the specification of the table keys, such as PRIMARY KEY
, and FOREIGN KEY
, those will also be covered separately.
The remainder of the code, after the line:
/*Data for the table `customers` */
Inserts the data in the table. The rest of the tables follow the same loading pattern, so we will not cover that.
In order to store and use data correctly, you need to understand the different standard data types. Those are normally common between different databaase systems. So far probably your main exposure to data types has been in your programming language of choice - R or Python. Even there there are often headaches, since the libraries for those languages for data input (i.e. readr
or pandas
) try to infer data types automatically to help you, and while that often works, it can also fail and lead to bugs (especially in base R factors).
Unless you are using tools such as Amazon Glue (covered in a later section), you will need to manually set up the data types. An overview on what is available is present in the diagram below:
The first questions that we had were relatively easy to answer and understand, and did not require complex operations across several tables. One of the most typical workflows of data scientists in industry is to do those basic selections and filtering and then work with the data in your language of choice (more on how to use R and Python to work with data from relational databases below), but sometimes we need more complex operations.
Blended learning: Advanced SQL
To learn the advanced concepts go through the following tutorials:
And finally this Kaggle Overview tutorial (2 hours).
Let’s illustrate those with a few questions that a business manager can often come up, and it is the job of the data person to answer:
The first thing that we should do is to determine what type and where we can get the data to answer this question. We should have a look at the ERD available above. We’ll need to connect the payments
and customers
tables somehow. In SQL those operations are called “joins” . But first - quick look at the ER diagram (remember we can generate it with DBeaver):
Now a quick detour on the possible joins. One might thnk joining two tables is a straightforward process, but in reality there are several ways to do this. Depending on your purposes you might go for one or the other. Those are illustrated in the diagram below:
In our case we’ll be using the inner join
.
select p.customerNumber , p.amount, c.city
from classicmodels.payments p
inner join customers c on p.customerNumber=c.customerNumber
This should combine both tables into one, with the columns connected on the key. An important first concept that you can notice here is the idea of aliasing. Instead of typing out all the table names, we can shortcut them like this classicmodels.payments p
. This we can replace the verbose payments
with p
in other parts of the SQL query. It might be confusing for programming folks, since the aliasing happens after its used.
Back to the analysis. The result from the previous chunk should look like this:
Now as a next step we need to aggregate per city and get the average. We can achieve this by using the avg
and group by
commands:
select avg(p.amount), c.city
from classicmodels.payments p
inner join customers c on p.customerNumber=c.customerNumber
group by city
And voila! Here are the results:
We managed to answer our analytical question with SQL.
Additional infromation: BI Software
While we as data scientists often prefer to visualise and analyse data programatically, even in our field sometimes it is useful to use a more specialised GUI. Those tools can be very advanced, and while they are mostly the domain of a data analyst, we can still derive value from them.
Two examples of such tools are Microsoft PowerBI and Tableau.
There are several ways to obtain data from a database with R. If you are using RStudio, probably themost intuitive initial option is to use the GUI interface for database connectors shown here:
A second option would be to connect to the database programatically. An example how to do that is shown below. As a first step we need to establish a connection. Here you would need the DBI
package (install separately), you should also know the database type.
library(dplyr)
con <- DBI::dbConnect(RSQLite::SQLite(), path = ":dbname:")
We can then use this connection object to authenticate with our credentials:
con <- DBI::dbConnect(RMySQL::MySQL(),
host = "database.rstudio.com",
user = "hadley",
password = rstudioapi::askForPassword("Database password")
)
And finally, let’s use this to get data in a dataframe:
flights_db <- tbl(con, "flights")
From this step onwards we are ready to process the data.
Blended learning: Using dplyr to work with databases
For this example we took snippets of code from the official documentation. In order to really understand how this works, and also appreciate the flexibility of connecting to a database programatically, go through the tutorial yourself (1 hour).
For Python there is no really nice GUI option available as for R in RStudio, but still the process is not much more different. As a first step we need to use the sqlalchemy
package and specify some parameters.
import pandas as pd
import sqlalchemy as sql
import pymssql
server = '100.10.10.10'
myQuery = '''SELECT first.Field1, second.Field2
FROM db1.schema.Table1 AS first
JOIN db2.schema.Table2 AS second
ON first.Id = second.FirstId'''
After this we create the engine
(pretty much the same as the con
object we used in R before).
engine = sql.create_engine('mssql+pymssql://{}'.format(server))
And finally we can use pandas
to read directly from the database into a dataframe.
df = pd.read_sql_query(myQuery, engine)
And we’re ready for analysis!
Memory issues
You should keep in mind that any time you are reading from a remote database to your local machine, the data is stored in the RAM memory, thus it is very easy to run out of resources while working.
A good idea is to first use an IDE (like DataGrip, which we covered before) to get your bearings, and then just get the data you need instead of all of it.
Portfolio project
Obtain the NYC Motor Vehicle Collisions data (in .json
format) from here. Split the data into three tables for each separate BOROUGH
, and have the ability to query them with a join.
Project deliverables:
Portfolio Project
Take the amazon reviews data from the NLP Module and store it in a PostgreSQL database by using the AWS RDS service. As an additional component access the data programatically either via R or Python (depending on the language you chose in the NLP module).
Project deliverables:
Determine the average numbers of orders per customer (classicmodels
dataset).
What are the 5 least profitable cities in the data (classicmodels
dataset)?
Which products are ordered in the highest quantity (classicmodels
dataset)?
What are the 5 least profitable cities in the data (classicmodels
dataset)?
What other domain specific databases can you find?
List three reasons for using a database?
How can you use SQL from within R and Python?
What are the two general types of databases?
If you have taken the Fundamentals of Command-line Tools module, could you mention one issue that both CLI tools and SQL solve?
What is the best data type to store a large chunk of text?
What are the two ways you can store Date
data to be machine readable?
What are NoSQL databases suitable for?