Datasette 0.50, git scraping, extracting columns

The first edition of Datasette Weekly

Hello subscribers! Welcome to the first edition of the Datasette Weekly newsletter.

Every week I’ll be sharing news and tips from the ecosystem of tools that is growing around Datasette and SQLite.

Datasette 0.50

Datasette 0.50 came out this week, closely followed by 0.50.1 and 0.50.2 to squash some tiny bugs that sneaked into the release.

The biggest new feature in this release is something I’m calling the column actions menu. A little cog icon in the header of each table column provides options for sorting or faceting by that column - and there’s space for new features to be added later, including potentially features defined by plugins.

You can try the menu out at https://latest.datasette.io/fixtures/facetable - latest.datasette.io is a Datasette instance which is deployed on every commit to the main branch in the repository, so it’s always available to try out new features.

Animated demo of the columns action menu, showing it used to sort a column and select two other columns for faceting

The other big feature in 0.50 is aimed at plugin authors. datasette.client is a new internal API for making requests to Datasette’s JSON HTTP API from within plugins themselves. This means that anything exposed in the external API can now be used by plugins as well. datasette.client is powered by the excellent HTTPX async Python HTTP library, taking advantage of it’s ability to simulate requests against an ASGI application without making a real network call. The new internal API is documented here.

Deploying Datasette

The datasette publish command has long been the recommended way of publishing a Datasette instance on the internet - just one command-line command can upload a database to Heroku, Cloud Run, Vercel or fly.io and configure it to be served by Datasette.

Sometimes this isn’t enough. The providers supported by datasette publish are all read-only: great for hosting static data but no use if you want to make changes to it without redeploying (e.g. using the datasette-upload-csvs plugin). Datasette 0.50 includes a new documentation section on Deploying Datasette with notes on deploying on Linux using systemd - I’ve used this myself for a few projects but I’d love to hear from other people who have tried this out and have suggestions for how I can improve the recipe.

Last week DigitalOcean announced App Platform, their new Heroku-style PaaS hosting platform. I figured out how to run Datasette on it and used that to inform a new section of the documentation on Deploying using buildpacks - the mechanism originally invented by Heroku which has now been adopted by various other providers. My simonw/buildpack-datasette-demo GitHub repo shows a simple example of how this style of deployment can be used.

Git scraping

Git scraping is a Datasette-adjacent technique I’ve been experimenting with over the past few years that involves scraping data sources into a Git repository, in order to track changes to those sources over time.

I finally wrote an explanation of it on my blog: Git scraping: track changes over time by scraping to a Git repository. This lead to an interesting discussion on Hacker News where a number of people shared their own experiences using this pattern.

Here’s a diff I captured illustrating the ongoing Zogg Fires in Northern California (scraped from www.fire.ca.gov/incidents into my simonw/ca-fires-history repo):

Screenshot of a diff against the Zogg Fires, showing personnel involved dropping from 968 to 798, engines dropping 82 to 59, water tenders dropping 31 to 27 and percent contained increasing from 90 to 92.

Having captured data in this way, the next challenge is to visualize and analyze that. In the past I’ve done this using SQLite and Datasette, by writing a script to walk through the repository history and create a SQLite database of the various changes. My best example of doing that is described in Tracking PG&E outages by scraping to a git repo from October last year.

Extracting columns with sqlite-utils extract

sqlite-utils is my combination Python library and CLI utility for creating and manipulating SQLite databases. It’s my attempt at answering the question “how do I get this into SQLite?” for as many different shapes of data as possible.

sqlite-utils 2.20, released last month, added a new tool for a common problem I face when working with data that started out as a CSV file: column extraction.

It’s common for data to have duplicated columns - this example has Department Code: POL and Department: Police for example. In an ideal, normalized world we would extract these columns out into a separate database table.

https://static.simonwillison.net/static/2020/refactored-one-column.png

That’s what sqlite-utils extract lets you do. I wrote a tutorial showing how to use it to refactor the Employee Compensation CSV file for the City of San Francisco into something a bit more relational:

Refactoring databases with sqlite-utils extract - 23rd September 2020

See also the project documentation on sqlite-utils extract.

Plugin of the week: datasette-graphql

Datasette has more than 50 plugins now, so I’m planning to highlight a plugin in the newsletter every week. This week: datasette-graphql.

I started working on this plugin partly to learn more about GraphQL and partly to demonstrate that it wasn’t a great fit for Datasette - I’ve long thought that SQL makes a more interesting API language than GraphQL, and Datasette supports SQL out of the box.

I’ve since changed my mind: GraphQL running on top of Datasette is actually a really pleasant way of interacting with data, thanks mainly to the brilliant GraphiQL API explorer tool which adds auto-complete and introspected documentation.

Here’s an example GraphQL query running against a demo Datasette instance with data I pulled in from GitHub using my github-to-sqlite tool. The query searches for repos matching “datasette” with at least 10 stargazers, then for each repository returns the first 5 issue titles supported by most recently created.

An interesting challenge with GraphQL is that, because it supports nested queries, it can lead to an enormous number of underlying SQL queries - a classic case of the N+1 queries problem.

Because SQLite runs as an in-process database library without needing the overhead of network requests to a server it’s actually an extremely good fit for implementing GraphQL - as explained by the classic essay Many Small Queries Are Efficient In SQLite.

I’ve started deploying datasette-graphql on all sorts of projects now - a few examples:

I also used it to help Natalie Downe build Rocky Beaches, a new website providing tips on when and where to go tidepooling around the SF Bay Area. That site takes advantage of the graphql() template function exposed by the plugin, which allows you to use GraphQL server-side to make data available to custom Datasette templates, as seen in the row-data-places.html template.

You can learn more about the plugin in the datasette-graphql README. Installation is as easy as running “datasette install datasette-graphql”.

Feedback, suggestions and more

This is my first time sending out this newsletter, so I’m really keen to hear your feedback. You can contact me at swillison@gmail.com, @simonw on Twitter or by posting on Datasette’s GitHub Discussions forum. I’d love to hear from you!