tutorials: /tutorials/learn-sql
This data as json
path | title | body |
---|---|---|
/tutorials/learn-sql | Learn SQL with Datasette | Tutorials > Learn SQL with Datasette Learn SQL with Datasette This tutorial follows the tutorial on exploring data with Datasette and shows how you can use Datasette to start learning to write custom SQL queries of your own. We will be using the same example database as that tutorial: https://congress-legislators.datasettes.com/legislators This database contains information about US Presidents, Vice Presidents and Members of Congress from 1789 to today. You can follow the previous tutorial to become familiar with the data. SQL and SQLite Datasette is software that runs on top of a SQL database. There are many different SQL database systems, such as MySQL, PostgreSQL, Microsoft SQL Server and Oracle. The database used by Datasette is called SQLite. You may not realize it, but you already use SQLite every day: it is built into many popular applications, including Google Chrome and Firefox, and runs on laptops, iPhones, Android phones and all sorts of other smaller devices. SQL stands for Structured Query Language - it is a text language for running queries against a database. Every database implements a slightly different dialect of SQL - for this tutorial I will try to keep to the subset of SQLite's dialect that is most likely to work across other databases as well. View and edit SQL Every table page in Datasette - such as this one - includes a "View and edit SQL" link that looks like this: Click that link to see the SQL query that was used for the current page - including any filters that have been applied, in this case the filter for bio_gender = "F". You can then edit that query to make it do something else! Exercises Start with that example, then: Click "Format SQL" to tidy this up and make it easier to read and edit Modify that query to return only the id and name columns Get it to order by name instead of sorting by id Change the p0 value to M to return male instead of female legislators Try replacing the list of columns with select * from - a shortcut for all columns The structure of a basic select query Here's the formatted version of the above query: select id, name, id_bioguide, id_govtrack, id_icpsr, id_wikipedia, id_wikidata, id_google_entity_id, name_first, name_last, bio_birthday, bio_gender, id_house_history, name_middle, name_nickname, id_ballotpedia, name_suffix, id_bioguide_previous, id_house_history_alternate, other_names, id_thomas, id_cspan, id_votesmart, id_lis, name_official_full, id_opensecrets, id_fec, id_maplight, leadership_roles, family from legislators where "bio_gender" = :p0 order by id limit 101 select The select section specifies which columns you would like to return. Each column name is separated by a comma - but if you have a comma following the last column you you will get an error message. near "from": syntax error from The from section specifies which table the records should be selected from - here we want the legislators table. where The where section adds filter conditions. These can be combined using and, for example this query will select just legislators who are male and have a Jr. name suffix: where "bio_gender" = 'M' and "name_suffix" = 'Jr.' This section is optional - if you do not include a where clause you will get back every row in the table. order by The optional order by clause specifies the order you woud like the rows to be returned in. This would order them alphabetically by name: order by name Or add desc to reverse the order. This query returns the youngest legislators: select * from legislators order by bio_birthday desc limit The limit 101 clause limits the query to returning just the first 101 results. In most SQL databases omitting this will cause all results will be returned - but Datasette applies an additional limit of 1,000 (example here) to prevent large queries from causing performance issues. Named parameters In the previous tutorial we used filters to list presidental terms that occurred in the 1800s, by filtering on rows where the start column began with the string 18 and the type column equals prez. Here's that example as a filtered table. Clicking that "View and edit SQL" link (and then clicking "Format SQL") shows this query: select rowid, type, start, [end], party, how, executive_id from executive_terms where "start" like :p0 and "type" = :p1 The :p0 and :p1 fields here correspond to "named parameters" - they provide a safe way to pass values to a SQL query, without having to worry about SQL injection. These field names are extracted from the query. If you change the where clause to look like this: where "start" like :century and "type" = :type Then fields called century and type will be displayed instead of p0 and p1. Since they are form fields, you can change those values too. Try changing prez to viceprez to see the vice presidential terms for the 1800s instead. If you weren't using named parameters, the where clause of the query could look like this instead: where "start" like '18%' and "type" = 'prez' String values like '18%' and 'prez' must be enclosed in single or double quotes (single quotes are preferred), if you are not using the :p0 syntax. The double quotes around the column names here are optional - they are only required if the column has a name that might clash with an existing SQL keyword, such as select or where - or if the column name contains a space. So the following where clause would work the same way: where start like '18%' and type = 'prez' SQL LIKE queries To find rows where the start column begins with the string 18 we use this where filter: where "start" like '18%' The SQL like operator applies wildcards to strings - % means "match anything" and _ (underscore) means "match a single character". Exercise Use a LIKE filter to find presidential terms that occurred in the 90s decade of any century - you'll need to use both _ and % for this. (solution) SQL Joins The executive_terms table has a column called executive_id which displays both a numeric ID and a link containing the name of the executive: But... when you click the "View and edit SQL" link the resulting query returns just the ID, not the name. A SQL join can be used to combine the data from multiple tables. Here's a query that uses a join to show the name pulled from the executives table: select executive_terms.type, executive_terms.start, executive_terms.[end], executive_terms.party, executive_terms.how, executives.name from executive_terms join executives on executive_terms.executive_id = executives.id Try that here. Since there are now two tables involved in the query, it's a good idea to include explicit table names as part of the select clause: select executive_terms.type, executive_terms.start, executive_terms.[end], executive_terms.party, executive_terms.how, executives.name If a column exists in only one of the two tables you can use its name without specifying the table, but this can quickly get confusing so it's better to always use the table names when you are executing a join. The from clause is where the join is defined. This describes how the two tables should be combined together: from executive_terms join executives on executive_terms.executive_id = executives.id We are joining executive_terms to executives, using the fact that the executive_id column in executive_terms contains values from the id column in executives. This is why IDs and foreign keys are important concepts to understand in SQL databases! This kind of join is also known as an "inner join" - it is the most commonly used join. Other join types include outer joins and full joins, but those are beyond the scope of this tutorial. GROUP BY / COUNT A common operation in SQL is to ask for a count of the most popular values in a column. Datasette exposes this capability in its faceting feature. Under the hood, that feature works by executing a group by / count query. The following query answers the question: which party has had the most presidental terms? select party, count(*) from executive_terms where type = 'prez' group by party Try that query here. The select clause asks for the party column and the result of running a count(*). count() is a SQL aggregate function, which operates on a group of results. So we need to define a group. The group by party clause at the end creates those groups. The combination of the two produces the desired result. You can add this order by clause at the end to see the groups sorted from highest to lowest: order by count(*) desc Here's the result. WHERE ... IN SQL queries can be nested together in a number of interesting ways. One of the most useful for ad-hoc data analysis is the where column in (select ...) pattern. Let's build a query to see everyone who has been president and has also been vice president. We'll start with a query to return the names of every vice president. This requires a join against executives, because the executive_terms table includes the presidential and vice presidential terms, but doesn't include the names of those individuals. select executives.name from executive_terms join executives on executive_terms.executive_id = executives.id where type = 'viceprez' This query returns 61 rows, including some duplicate rows for individuals who served more than one term as VP. We could use select distinct to remove those duplicates if we wanted to. If we want to see everyone who has been a president and ALSO been a vice-president, we can combine two queries together like this: select distinct executives.name from executive_terms join executives on executive_terms.executive_id = executives.id where type = 'prez' and name in ( select executives.name from executive_terms join executives on executive_terms.executive_id = executives.id where type = 'viceprez' ) This query returns 15 names, starting with John Adams and finishing with Joseph Biden. ⚠️ An earlier version of this example revealed a subtle bug: there were two separate individuals in the executives table with the same name, George Bush and George Bush! So even though the query appeared to return the correct results it actually contained a bug. This is a useful demonstration of how important it is to work with unique, de-duplicated identifiers where-ever possible, rather than assuming that things like names are unique. Common Table Expressions This is a more advanced SQL technique that I find myself using every day: it can make complex SQL queries much easier to write and understand. Common Table Expressions, or CTEs, allow you to define a temporary alias for a select that lasts for the duration of the current query. You can then treat it as if it was a regular table. Here's an example that creates two CTEs, one called presidents and one called vice_presidents, and then uses them to answer the earlier question about presidents who had also served as vice presidents: with presidents as ( select executives.name from executive_terms join executives on executive_terms.executive_id = executives.id where executive_terms.type = 'prez' ), vice_presidents as ( select executives.name from executive_terms join executives on executive_terms.executive_id = executives.id where executive_terms.type = 'viceprez' ) select distinct name from presidents where name in vice_presidents Try this CTE query here. The two CTEs are defined using with alias_name as (select ...), second_alias as (select ...) - then the query that returns the final results is added on at the end. Creating bookmarkable apps Since every page on Datasette can be linked to (see Sharing links) and named parameters automatically add form input to the query page, you can combine these two features to create bookmarkable apps. This is particularly useful for collaborating with others who don't yet know SQL. You can write SQL queries that solve problems for them, add parameters that they can customize and then send them the entire custom application as a link that they can bookmark. Let's create an application that returns the president and vice president for any given date. Here's the SQL query we will be using: select executives.name, executive_terms.type, executive_terms.start, executive_terms.end, executive_terms.party from executive_terms join executives on executive_terms.executive_id = executives.id where start <= :date and end > :date This query takes a :date parameter, which should be of the format yyyy-mm-dd - for example 2016-01-01. It joins against the executives table to get their name, and then filters to rows where the term start is less than or equal to the specified date, and the end is greater than that date. We use <= for one of these to ensure there are no gaps or overlaps. If we execute the query without a date it will return 0 results, but give us a form field to enter the date: Entering a date - for example 2016-01-01 - returns the president and vice president for that date: We've built an application! We can send this link to anyone right now and they'll be able to run that same query. The SQL at the top of the page could feel a little intimidating though. That's what the "hide" link is for - clicking on that hides the SQL query, providing a new link that you can share that won't cause the user to scroll down past the SQL in order to interact with the query. One last note: the application we have built also doubles up as an API. Add .json to the path portion of the URL, or click the .json link, to get back the data as JSON. Add &_shape=array to the URL to get back a more compact form of JSON, ideal for integrating with other applications. Next steps Fully mastering SQL can take years. This tutorial has attempted to cover SQL basics and introduce some more advanced techniques to help manage larger queries, but there's plenty more to learn. The SQLite SELECT documentation offers a comprehensive reference to SQL as understood by SQLite, but the best way to learn is to keep on trying new things against different data, looking things up as you go along. |