We'll be going through six questions covering topics like query performance, joins, and SQL injection. They'll refer to the same database for cakes, customers, and orders at a bakery. Here's the schema:
We'll be using MySQL for consistency.
Want to get set up with sample data? Here's how to get four cakes, a million customers, and a million orders:
In your terminal, download our script and start up MySQL:
Then run our script to set up the BAKERY database and insert data:
If you want to come back to the data again and you already downloaded and ran the script:
- In terminal, start up MySQL: mysql.server start && mysql -u root
- In the MySQL shell, use the bakery database: USE BAKERY;
Alright, let's get started.
How can we make this query faster?
We want the order ID of every order in January and February, 2017. This is the query we've got so far:
First, we could consider adding an index on pickup_date. This'll make our inserts less efficient, but will drastically make this query faster.
Bringing up the tradeoffs of your suggestions will impress your interviewer. Excellent answers always consider the bigger picture—side effects or downsides, and any alternative solutions.
Let’s try it:
Whoa! What happened? We added an index but didn't get an improvement.
This is because we're using the DATEDIFF function in the WHERE clause. Functions evaluate for every row in a table without using the index!
Easy fix here—we can just compare the pickup date and March 1 directly:
There we go, about 0.3 seconds down to 0.07 seconds.
How can we get the nth highest number of orders a single customer has?
We made a view of the number of orders each customer has:
So for example, Nancy has 3 orders:
Let’s start by solving a simpler problem—how would we just get the highest number of orders?
Pretty trivial since we have a built-in function:
Now can we adapt this to get the second highest number of orders?
Well, if we can get the highest order count, we can get "the highest order count that's not the highest order count":
This works, but it’s pretty slow. Any ideas for a faster query?
We could also sort the order_counts. Then we don’t have to scan the whole table twice. We just jump down to the second row:
If LIMIT has one argument, that argument is the number of rows to return starting with the first row. With 2 arguments, the first argument is the row offset and the second argument is the number of rows to return. So in our query with “1, 1” we're saying "starting one row down from the top, give us one row."
Now, how do we get the nth highest order count?
Easy—we just change the row offset in our LIMIT clause:
As a bonus, how would you do this with pure SQL, not relying on MySQLs handy LIMIT clause? It’s tricky!
What ways can we use wildcard characters in LIKE clauses?
LIKE lets you use two wildcard characters, "%" and "_".
"%" matches any amount of characters (including zero characters). So if we want all our customers whose last name starts with "A", we could query:
(We could use the BINARY keyword if we wanted a case-sensitive comparison. It would treat the string we’re comparing as a binary string, so we’d compare bytes instead of characters.)
"_" matches exactly one character. If we want all our customers who live on the 200 block of Flatley Avenue in Dover, we could query:
And some databases (like SQL Server, but not MySQL or PostgreSQL) support sets or ranges of characters. So we could get every customer whose city starts with either "m" or "d":
Or whose last name starts with any character in the range “a” through “m” (“a”, “b”, “c”...“k”, “l”, “m”):
Now how can we make this query faster?
We're mailing a promotion to all our customers named Sam who live in Dover. Since some customers go by names that aren't exactly Sam, like Samuel or Sammy, we use names like Sam. Here's how we find them:
That's pretty slow. How can we speed it up?
First, do we need to get the city and zip code for every customer? The search is constructed using the city Dover, so we know the city. And we know that the zip code for Dover is 33220. If we can complete the addresses efficiently somewhere else in our code, there's no reason to get that information from the database for every result.
A little better, but only a little.
Let's look at that wildcard % before "sam." Wildcards at the beginning of comparisons can slow down performance because instead of just looking at names that start with "sam" the query has to look at every character in every first name.
Do we really need the wildcard % before sam? Should our customers with "sam" in their name but not at the start of their name, like Rosamond, be included in a Sam Promotion?
Probably not. Let's just try removing the % at the beginning and adding an index on first_name:
0.42 seconds down to 0.02 seconds!
This is a huge improvement. But—these changes are a big deal because we’re changing functionality. This isn't just faster, it's different. Some customers won't be getting a promotion in the mail now. The decision of who's included in the promotion would probably be made independent of database performance. But it's always a good idea to look out for wildcard characters at the beginning of a pattern.
What are all the SQL joins?
First, let’s talk about the keywords "inner" and "outer." They’re optional—INNER JOIN is the same as JOIN, and LEFT OUTER JOIN is the same as LEFT JOIN. These keywords are added for clarity because they make the joins easier to understand conceptually. Some developers leave them out, arguing there's no reason to have extra nonfunctional words in a database query. The most important thing is to be consistent. We’ll use them.
Inner joins give only the rows where all the joined tables have related data. If we inner join our customers and orders, we'll get all the related customers and orders. We won't get any customers without orders or any orders without customers.
If we wanted the cake flavor, not just the cake ID, we could also join the cake table:
Left outer joins give all the rows from the first table, but only related rows in the next table. So if we run a left outer join on customers and orders, we'll get all the customers, and their orders if they have any.
Right outer joins include any related rows in the first table, and all the rows in the next table. Right outer joining our customers and orders would give the customer if there is one, and then every order.
In our schema, customer_id isn’t NOT NULL on orders. This may be seem unintuitive, but maybe we don’t require customers to register with us to place an order, or orders can be associated with other models like restaurant or vendor. In any case, with our schema, we can have orders without customers.
Right outer joins give the same result as left outer joins with the order of the tables switched:
Full outer joins take all the records from every table. Related data are combined like the other joins, but no rows from any table are left out. For customers and orders, we'll get all the related customers and orders, and all the customers without orders, and all the orders without customers.
The standard SQL syntax is:
But MySQL doesn't support full outer joins! No problem, we can get the same result with a UNION of left and right outer joins:
Using UNION or UNION ALL with this strategy generally emulates a full outer join. But things get complicated for some schemas, like if a column in the ON clause isn't NOT NULL.
Cross joins give every row from the first table paired with every row in the next table, ignoring any relationship. With customers and orders, we'd get every customer paired with every order. Cross joins are sometimes called Cartesian joins because they return the cartesian product of data sets—every combination of elements in every set.
This isn't used often because the results aren't usually useful. But sometimes you might actually need every combination of the rows in your tables, or you might need a large table for performance testing. If you cross join 2 tables with 10,000 rows each, you get a table with 100,000,000 rows!
Self joins refer to any join that joins data in the same table. For example, some of our customers were referred to our bakery by other customers. We could do a left outer join to get every customer and their referrer if they have one:
What’s an example of SQL injection and how can we prevent it?
SQL injection is when a hacker gains access to our database because we used their malicious user input to build a dynamic SQL query.
Let's say we have an input field that takes a phone number, and we use it to build this SQL query:
We’re expecting something like "8015550198" which would neatly build:
But what if a user enters "1' OR 1=1;--"?
Then we’d have:
Which will return the data for every customer because the WHERE clause will always evaluate to true! (1 always equals 1, and the "--" comments out the last single quotation mark.)
With the right input and queries, SQL injection can let hackers create, read, update and destroy data.
So to prevent SQL injection, we'll need to look at how we build and run our SQL queries. And we can think about some smart technical design in our application.
Here are five ways to protect ourselves:
1. Use stored procedures or prepared SQL statements. So do not build dynamic SQL. This is the most effective way to prevent SQL injection.
For example, we could build a prepared statement:
Or we could build a stored procedure get_customer_from_phone with a string parameter input_phone:
which we could call like this:
2. Validate the type and pattern of input. If you know you're looking for specific data—like an ID, name, or email address—validate any user input based on type, length, or other attributes.
For example, here’s one way we could validate a phone number:
3. Escape special characters like quotes. This method is a quick and easy way to reduce the chances of SQL injection, but it's not fully effective.
For example, let's say we want to escape backslashes, single and double quotes, new lines (\n and \r), and null (\0):
(See table 10.1 in the MySQL String Literals docs for a full list of special character escape sequences.)
In Python, you could also use the _mysql module's escape_string function (which additionally escapes the SQL wildcard characters "%" and "_"):
When we escape our input, now our query will be:
which isn't a valid query.
4. Limit database privileges. Application accounts that connect to the database should have as few privileges as possible. It's unlikely, for example, that your application will ever have to delete a table. So don't allow it.
5. Don't display database error messages to users. Error messages contain information that could tell hackers a lot of information about your data. Best practise is to give generic database error messages to users, and log detailed errors where developers can access them. Even better, send an alert to the dev team when there’s an error.