Datasette 0.54, and querying a database by drawing on a map

Datasette Weekly(ish) volume 6

Since the last edition of this newsletter I’ve made 33 releases across 13 different Datasette ecosystem projects. Some of the highlights.

datasette-leaflet-freedraw

Datasette supports parameterized SQL queries: if you compose a SQL query with a placeholder such as “select * from counties where population > :number” Datasette will spot the :number parameter and turn it into a form field on a page:

I’ve been planning to add custom form field widgets for a while. datasette-leaflet-freedraw is a new plugin which starts exploring this concept, by enhancing any fields ending in “freedraw” with a map widget that lets users draw a shape, which will be converted to a GeoJSON MultiPolygon ready to be sent back to Datasette.

Here’s an interactive demo:

You can read more about this plugin in Drawing shapes on a map to query a SpatiaLite database.

Datasette 0.54

I released Datasette 0.54 yesterday. You can read the annotated release notes on my blog, but the two two big new features in this release are the _internal SQLite in-memory database storing details of all connected databases and tables, and support for JavaScript modules in plugins and additional scripts.

The _internal database

I want Database to be able to handle dozens of attached database files containing potentially thousands of tables. To support this, the Datasette homepage will need to get a lot smarter: it needs to provide pagination and search and filtering against the list of tables themselves.

Datasette is already pretty good at providing paginated search and filtering! The new _internal database is the first step to providing this feature: it's an in-memory database which is automatically updated with details of all of the attached databases, plus their columns, foreign keys and indexes.

As part of building this I added support for shared, named in-memory databases - so plugins can now create an in-memory database that will persist for the lifetime of the Datasette process and can be accessed from multiple connections.

You can browse a demo of the _internal database by signing into the demo as root and then accessing latest.datasette.io/_internal

JavaScript modules

JavaScript modules are supported by modern browsers and allow you to use the “import” and “export” keywords from ECMAScript 2015 without needing a build step.

They only work in <script> blocks with the type="module" attribute, so Datasette now has the ability to include scripts with that attribute, for both plugins and extra scripts loaded through configuration.

A big benefit of native modules is that they help avoid accidentally loading the same script twice. There are three Datasette plugins that use Leaflet maps now - datasette-cluster-map, datasette-leaflet-geojson and the new datasette-leaflet-freedraw - so I've started working on a new base plugin called datasette-leaflet which makes the latest Leaflet available for all of them to use via the modules system.

datasette-css-properties and datasette-export-notebook

Two more new plugins: these use Datasette's register_output_renderer hook to add extra options for getting data back out of Datasette.

I wrote about datasette-css-properties here - it's a very weird plugin! It was inspired by Custom Properties as State by Chris Coyier, which proposed server APIs that return CSS files defining custom properties that can be used to style pages.

I'm not sure if this is a good idea, but the joy of plugins is that you can try out ideas like this without risking damage to the integrity of the core project!

datasette-export-notebook is more obviously useful: it adds instructions for exporting data from Datasette into Jupyter and Observable notebooks. You can try that out here against the Covid-19 Datasette.

Datasette Office Hours

I've now had 17 conversations with people about Datasette in office hours sessions. It's been absolutely wonderful! Talking to people about what they're building, or what they might build, is incredibly valuable for helping me decide what to work on next, and it's also just really fun talking to people about the project.

I plan to continue running office hours sessions on Friday for the forseeable future. If you'd like to chat about the project you can grab a slot with me here.

Official project website for Datasette, building a search engine with Dogsheep Beta, sqlite-utils analyze-tables

Datasette Weekly(ish) volume 5

Datasette finally has an official project website!

datasette.io is the new home for Datasette. The site features news about the project, explains the problems that Datasette and its ecosystem of tools try to solve and gathers together a complete directory of plugins and tools for using with Datasette.

The site itself is built using Datasette - it’s actually a heavily themed installation of Datasette, making extensive use of the datasette-template-sql plugin and custom pages to implement the different sections of the site. The code for the site is available on GitHub at simonw/datasette.io.

I wrote more about how the site works in datasette.io, an official project website for Datasette.

Adding search to datasette.io

datasette.io also includes a search engine for the project.

It provides faceted search across:

  • Every section of the latest documentation (416 total)

  • 48 plugin READMEs

  • 22 tool READMEs

  • 64 news items posted on the Datasette website

  • 212 items from my blog

  • Release notes from 557 package releases

That’s 1,319 items total! The overall Datasette project continues to grow in all kinds of directions, so having a single search engine to tie it all together felt like a useful addition.

I built the search engine using Dogsheep Beta, the cross-table SQLite search engine I originally built for my Dogsheep Personal Analytics project.

You can read more about how the search engine works, including a breakdown of the YAML configuration used to create the unified search index, in Building a search engine for datasette.io.

sqlite-utils analyze-tables

I released sqlite-utils 3.1 with a useful feature for getting to grips with new data: the “sqlite-utils analyze-tables” command.

Run this against any SQLite database and it will spit out a summary of every column in every table, showing you common and least-common values, the number of distinct items and how many rows contain nulls or blanks.

The analyze-tables documentation has more. In addition to outputting the summary to the command-line you can add the “--save” option to store the generated summary in a SQLite table called _analyze_tables_ - which means you can then further examine it using Datasette.

Here’s an example of the output when running analyze-tables against a database created using my github-to-sqlite tool.

Office hours for 2021

Every Friday I run Datasette Office Hours sessions, where people who use Datasette (or who are interested in using it) can grab a 20 minute video chat with me to talk about the project. I’ve done a couple of weeks of these now and they’ve been fantastic - it’s so interesting hearing how people are using or plan-to-use the tool, and I’ve already had a bunch of great feedback to help me plan the next steps for the project.

If you’re interested in talking to me, slots in January 2021 are available now!

Datasette Office Hours, Personal Data Warehouses, datasette-ripgrep, datasette-indieauth, Datasette Client for Observable and more

Datasette Weekly(ish) volume 4

In this edition: book Datasette office hours, build a personal data warehouse, deploy a regular expression code search engine, plus updates on Datasette project releases from the past month.

Datasette Office Hours

One of the toughest things about running open source projects is the challenge of getting feedback: users often assume they are doing you a favor by leaving you alone, so generally you only hear from people if they find a bug.

I’m always really keen to talk to people who are using Datasette - or who aren’t using it yet, since I want to understand what it’s missing.

So I’m going to try a new approach: I’m setting aside time every Friday for Datasette Office Hours, where anyone can book a 20 minute Zoom call with me to talk about the project.

I’d love to hear from you if:

  • You’re solving problems with Datasette

  • You have a problem you think Datasette might be able to help with

  • You’ve run into problems using Datasette

  • You have thoughts on the direction you think the project should go

  • You’d like to see some demos of things I’m working on

  • You’re just interested in having a chat!

You can sign up for office hours at calendly.com/swillison/datasette-office-hours

Personal Data Warehouses

A gave a talk for the GitHub OCTO Speaker Series a couple of weeks ago about my Datasette and Dogsheep projects called Personal Data Warehouses: Reclaiming Your Data.

GitHub shared the video on their YouTube channel, and I’ve prepared an extended, annotated version of the talk with additional screenshots, links and notes.

The talk shows how I built my own personal data warehouse on top of Datasette that imports, analyzes and visualizes my data from Twitter, Swarm, GitHub, 23AndMe, Apple HealthKit, Apple Photos and more.

datasette-ripgrep

datasette-ripgrep is a web application I built for running regular expression searches against source code, built on top of the amazing ripgrep command-line tool.

Here are some example searches, running across around 100 Datasette and Datasette- adjacent GitHub repositories:

I wrote about the project on my blog, in datasette-ripgrep: deploy a regular expression search engine for your source code

It’s an interesting use-case for Datasette in that it doesn’t use SQLite at all - the tool works by running the “rg” executable against a folder full of source code. It does benefit from Datasette’s “datasette publish” mechanism - the following one-liner will deploy a Datasette instance to Google Cloud Run pre-configured to run searches against everything in the “all/“ directory, which is uploaded as part of the deployment:

datasette publish cloudrun \
    --metadata metadata.json \
    --static all:all \
    --install=datasette-ripgrep \
    --service datasette-ripgrep \
    --apt-get-install ripgrep

The official demo is deployed by this GitHub Actions workflow which pulls a list of repos using github-to-sqlite, filters down to just the ones I want to include in the demo, then deploys them using the above pattern.

datasette-indieauth

My other big plugin project this month was datasette-indieauth, an authentication plugin which adds support for the emerging IndieAuth single sign-on standard.

You can read more about this project in Implementing IndieAuth for Datasette on my blog. IndieAuth is a spiritual successor to OpenID which allows users to sign-in using a website address that they control. It’s a particularly good fit for Datasette as it allows you to deploy single sign-on without first registering your site with a central authority.

Here’s an animation showing what the experience looks like signing in to the official demo at datasette-indieauth-demo.datasette.io/-/indieauth:

Datasette Client for Observable

Alex Garcia built a beautiful JavaScript client library for interacting with data hosted by Datasette from Observable notebooks. His demo at observablehq.com/@asg017/datasette-client shows how to use the client and demonstrates it integrating with Observable’s form elements and visualizing data as a stacked area chart using D3.

Other significant releases this month

Datasette 0.52 (and 0.52.1) - a relatively small release, this adds a new database_actions(datasette, actor, database) plugin hook and renames the --config option to --setting - --config still works but shows a deprecation message, and will be removed in Datasette 1.0.

github-to-sqlite 2.8 adds a new “github-to-sqlite workflows” command which imports GitHub Actions workflow YAML flies and uses them to populate new workflows, jobs and steps tables.

datasette-graphql 1.2 and 1.3 add support for the Datasette view-instance permission and view-database permissions, and use the new table actions plugin hook to add example GraphQL queries to the cog action menu on every table page. You can try that out against this commits table created using github-to-sqlite.

sqlite-utils 3.0 adds a new command-line tool and Python method for executing full-text searches against a table, and returning the results ordered by relevance. It also adds a new --tsv output option and makes some small changes to other command-line options, hence the 3.0 major version bump.

Datasette 0.51 - new visual design, smarter plugins

Datasette Weekly volume 3

In this edition: annotated release notes for Datasette 0.51, and datasette-copyable as plugin of the week.

(You may have seen these annotated release notes on my blog - if so, feel free to skip straight to plugin of the week.)

Datasette 0.51, the annotated release notes

I shipped Datasette 0.51 at the weekend, with a new visual design, plugin hooks for adding navigation options, better handling of binary data, URL building utility methods and better support for running Datasette behind a proxy. It’s a lot of stuff! Here are the annotated release notes.

New visual design

Datasette is no longer white and grey with blue and purple links! Natalie Downe has been working on a visual refresh, the first iteration of which is included in this release. (#1056)

It’s about time Datasette grew beyond its clearly-designed-by-a-mostly-backend-engineer roots. Natalie has been helping me start adding some visual polish: we’ve started with an update to the colour scheme and will be continuing to iterate on the visual design as the project evolves towards the 1.0 release.

The new design makes the navigation bar much more obvious, which is important for this release since the new navigation menu (tucked away behind a three-bar icon) is a key new feature.

Plugins can now add links within Datasette

A number of existing Datasette plugins add new pages to the Datasette interface, providig tools for things like uploading CSVs, editing table schemas or configuring full-text search.

Plugins like this can now link to themselves from other parts of Datasette interface. The menu_links(datasette, actor) hook (#1064) lets plugins add links to Datasette’s new top-right application menu, and the table_actions(datasette, actor, database, table) hook (#1066) adds links to a new “table actions” menu on the table page.

This feature has been a long time coming. I’ve been writing an increasing number of plugins that add new pages to Datasette, and so far the main way of using them has been to memorise and type in their URLs!

The new navigation menu (which only displays if it has something in it) provides a global location to add new links. I’ve already released several plugin updates that take advantage of this.

The new “table actions” menu imitates Datasette’s existing column header menu icon—it’s a cog. Clicking it opens a menu of actions relating to the current table.

Want to see a demo?

The demo at latest.datasette.io now includes some example plugins. To see the new table actions menu first sign into that demo as root and then visit the facetable table to see the new cog icon menu at the top of the page.

Here’s an animated GIF demo showing the new menus in action.

Binary data

SQLite tables can contain binary data in BLOB columns. Datasette now provides links for users to download this data directly from Datasette, and uses those links to make binary data available from CSV exports. See Binary data for more details. (#1036 and #1034).

I spent a ton of time on this over the past few weeks. The initial impetus was a realization that Datasette CSV exports included ugly Python b'\x15\x1c\x02\xc7\xad\x05\xfe' strings, which felt like the worst possible way to display binary in a CSV file, out of universally bad options.

Datasette’s main interface punted on binary entirely—it would show a <Binary data: 7 bytes> label which didn’t help much either.

The only way to get at binary data stored in a Datasette instance was to request the JSON version and then manually decode the Base-64 value within it!

This is now fixed: binary columns can be downloaded directly to your computer, using a new .blob output renderer. The approach is described on this new page in the documentation.

Security was a major consideration when building this feature. Allowing the download of arbitrary byte payloads from a web server is dangerous business: it can easily result in XSS holes where HTML with dangerous <script> content can end up hosted on the primary domain.

After some research, I decided to serve up binary content for download using the following headings:

content-type: application/binary x-content-type-options: nosniff content-disposition: attachment; filename="data-f30889.blob" 

application/binary is a safer Content-Type option than the more common application/octet-stream, according to Michal Zalewski’s renowned web application security book The Tangled Web (quoted here)

x-content-type-options: nosniff disables the XSS-tastic content sniffing feature in older versions of Internet Explorer, where IE would helpfully guess that you intended to serve HTML based on the first few bytes of the response.

The content-disposition: attachment header causes the browser to show a “download this file” dialog, using the suggested filename.

If you know of a reason that this isn’t secure enough, please let me know!

URL building

The new datasette.urls family of methods can be used to generate URLs to key pages within the Datasette interface, both within custom templates and Datasette plugins. See Building URLs within plugins for more details. (#904)

Datasette’s base_url configuration setting was the forcing factor around this piece of work.

It allows you to configure Datasette to serve content starting at a path other than /—for example:

datasette --config base_url:/path-to-datasette/ 

This will serve all Datasette pages at locations starting with /path-to-datasette/.

Why would you want to do this? It’s useful if you are proxying traffic to Datasette from within the URL hierarchy of an existing website.

The feature didn’t work properly, and enough people care about it that I had a steady stream of bug reports. For 0.51 I gathered them all into a single giant tracking issue and worked through them all one by one.

It quickly became apparent that the key challenge was building URLs within Datasette—not just within HTML template pages, but also for things like HTTP redirects.

Datasette itself needed to generate URLs that took the base_url setting into account, but so do Datasette plugins. So I built a new datasette.urls collection of helper methods and made them part of the documented internals API for plugins. The Building URLs within plugins documentation shows how these should be used.

I also added documentation on Running Datasette behind a proxy with example configs (tested on my laptop) for both nginx and Apache.

The datasette.client mechanism from Datasette 0.50 allows plugins to make calls to Datasette’s internal JSON API without the overhead of an HTTP request. This is another place where plugins need to be able to construct valid URLs to internal Datasette pages.

I added this example to the documentation showing how the two features can work together:

table_json = (     await datasette.client.get(         datasette.urls.table("fixtures", "facetable", format="json")     ) ).json()

One final weird detail on this: Datasette now has various methods that automatically add the base_url prefix to a URL. I got worried about what would happen if these were applied more than once (as above, where datasette.urls.table() applies the prefix so does datasette.client.get()).

I fixed this using the same trick that Django and Jinja use to avoid appliying auto-escaping twice to content that will be displayed in HTML: the datasette.urls methods actually return a PrefixedUrlString object which is a subclass of str that knows that the prefix has been applied! Code for that lives here.

Smaller changes

A few highlights from the “smaller changes” in Datasette 0.51:

  • Wide tables shown within Datasette now scroll horizontally (#998). This is achieved using a new <div class="table-wrapper"> element which may impact the implementation of some plugins (for example this change to datasette-cluster-map).

I think this is a big improvement: if your database table is too wide, it now scrolls horizontally on the page (rather than blowing the entire page out to a wider width). You can see that in action on the global-power-plants demo.

If you are signed in as root the new navigation menu links to a whole plethora of previously-undiscoverable Datasette debugging tools. This new permission controls the display of those items.

  • Link: HTTP header pagination. (#1014)

Inspired by GitHub and WordPress, which both use the HTTP Link header in this way. It’s an optional extra though: Datasette will always offer in-JSON pagination information.

  • Edit SQL button on canned queries, (#1019)

Suggested by Jacob Fenton in this issue. The implementation had quite a few edge cases since there are certain categories of canned query that can’t be executed as custom SQL by the user. See the issue comments for details and a demo.

  • --load-extension=spatialite shortcut. (#1028)

Inspired by a similar feature in sqlite-utils.

  • datasette -o option now opens the most relevant page. (#976)

This is a fun little feature. If your Datasette only loads a single database, and that database only has a single table (common if you’ve just run a single CSV import) then running this will open your browser directly to that table page:

datasette data.db -o 
  • datasette --cors option now enables access to /database.db downloads. (#1057)

This was inspired by Mike Bostock’s Observable Notebook that uses the Emscripten-compiled JavaScript version of SQLite to run queries against SQLite database files.

It turned out you couldn’t use that notebook against SQLite files hosted in Datasette because they weren’t covered by Datasette’s CORS option. Now they are!

Recommendations for plugin authors, inspired by a question from David Kane on Twitter. David has been building datasette-reconcile, a Datasette plugin that offers a reconciliation API endpoint that can be used with OpenRefine. What a brilliant idea!

Plugin of the week: datasette-copyable

One of my goals with Datasette is to make getting data out of it as easy as possible. I want Datasette to be the obvious place to store your data because you’ll be able to transform it (either manually or automatically) into any other format you could possible want, often by taking advantage of plugins.

datasette-copyable is an acknowledgement that often the easiest way to export data is using copy and paste! It adds the ability to export Datasette tables in 21 different formats - from TSV to Jira table markup to LaTeX, Mediawiki markup or HTML.

The most useful of these is the first one: TSV. The great thing about tab-separated data is that you can paste it directly into Microsoft Excel, Apple Numbers or Google Sheets and it will be automatically arranged into cells that correspond to the original table. This makes copy-as-TSV by far the easiest way to get small amounts of data out of Datasette and into a spreadsheet.

The other flavours of export are provided by the excellent Tabulate Python library, which is also used for the different command-line export options provided by sqlite-utils.

You can try out datasette-copyable on fivethirtyeight.datasettes.com - this linked example provides exports of airline safety data.

Since copy-and-paste is such a useful way of moving data around, I’m starting to percolate the idea of supporting copy-and-paste of data into Datasette as well.

Dogsheep: Personal analytics with Datasette

Datasette Weekly volume 2

This week’s newsletter will take a deep dive into Dogsheep, my project to build tools for personal analytics on top of Datasette and SQLite.

Dogsheep

The name requires some explanation. In February 2019 Stephen Wolfram published Seeking the Productive Life: Some Details of My Personal Infrastructure - an astonishingly detailed essay about the extreme lengths he has gone to over the years to optimize his personal productivity. Stephen runs a 1,000+ person company as a remote CEO, so he’s been figuring out productive remote working since long before Covid forced the rest of us to all learn to work from home.

There’s a lot of stuff in there, but one thing that stood out to me in particular was Stephen’s description of his “metasearcher”:

A critical part of my personal infrastructure is something that in effect dramatically extends my personal memory: my “metasearcher”. At the top of my personal homepage is a search box. Type in something like “rhinoceros elephant” and I’ll immediately find every email I’ve sent or received in the past 30 years in which that’s appeared, as well as every file on my machine, and every paper document in my archives.

That sounded pretty great to me, so I asked myself if I could bring all of my data from different sources together into a single personal search engine. And since I wanted to build something that was inspired by Wolfram but not nearly as impressive, I decided to call it Dogsheep.

Stephen has a search engine called Wolfram Alpha, so obviously my search engine would be called Dogsheep Beta. And I admit that this pun amused me so much I felt obliged to build the software!

Dogsheep is a collection of open source tools designed to convert personal data from different sources into SQLite databases, so that I can explore them with Datasette. The tools I’ve written so far are:

These tools have inspired some other developers to build their own. Here are the other tools in the Dogsheep family:

If you want to see a demo of my own personal Dogsheep instance in action, take a look at the talk I gave about the project at PyCon AU 2020:

I used this Google Doc as a handout for the session, with links, notes and the Q&A.

Dogsheep Beta

I finally got dogsheep-beta working a few weeks ago. It’s the search engine that ties everything else together, and it works by building a single SQLite table with a search index built against copies of data pulled from all of the other sources.

It uses YAML configuration to specify queries that should be run against each data source to build the index - extracting out titles, timestamps and searchable text for each record.

The YAML can also define SQL to be used to load extra details about search results in order to feed variables to a template, and those templates can then be embedded in the YAML file as well. This means you can use Dogsheep Beta to quickly define custom multi-source search engines for any kind of content that you’ve previously pulled into a SQLite database.

It may even be useful for things other than personal search engines! I should really get a live demo up of it running somewhere that’s not full of my own personal data.

In the meantime, here’s a screenshot of my personal Dogsheep Beta instance showing a search for #dogfest that returns both tweets and Swarm checkins:

I released dogsheep-beta 0.9 at the weekend, upgraded to use the new datasette.client internal API request mechanism added in Datasette 0.50.

Datasette 1.0 API changes

I continue to work towards a Datasette 1.0 release. Datasette is pretty stable at the moment, but for 1.0 I want to be able to make a promise that plugins built against Datasette and external code written against Datasette’s JSON API will continue to work without changes until a 2.0 release that breaks backwards-compatibility: and with any luck I hope to stay in the 1.x series for as long as possible.

Datasette’s default JSON API at the moment isn’t quite right. When I work with it myself I almost always find myself opting for the ?_shape=array option - compare the following:

Default: covid-19.datasettes.com/covid/ny_times_us_counties.json

shape=array: covid-19.datasettes.com/covid/ny_times_us_counties.json?_shape=array

I’m reconsidering the default design at the moment in #782: Redesign default JSON format in preparation for Datasette 1.0. This is one of the most significant remaining tasks before a 1.0 release, so I want to be sure to get it right!

I’m experimenting with the new default shape in a plugin called datasette-json-preview, which adds a new .json-preview extension showing what I’m thinking about. You can try that out at https://latest-with-plugins.datasette.io/fixtures/facetable.json-preview - that’s a demo Datasette instance I run which installs as many plugins as possible in order to see what happens when they all run inside the same instance.

I am very keen to get feedback and suggestions on the JSON redesign. Please post any comments you might have on the issue.

Plugin of the week: datasette-cluster-map

datasette-cluster-map is the first Datasette plugin I released, and remains one of my favourites.

Once installed, it quietly checks each table to see if it has latitude and longitude columns. If it does, the plugin triggers, loads up a Leaflet map and attempts to render all of the rows with a latitude and longitude on a map.

It’s using Leaflet.markercluster under the hood, which in my experienc will quite happily display hundreds of thousands of clustered markers on a single page.

Here’s my favorite demo: a table of 33,000 power plants around the world using data maintained by the World Resources Institute. You can show them all on one map, or you can use Datasette facets to filter down to e.g. the 429 hydro plants in France:

I find myself using this plugin constantly - any time I come across data with latitude and longitude columns I’ll render it on a map, and often I’ll spot interesting anomolies - like the fact that the island of Corsica in the Mediterranean remains a region of France and has 7 Hydro power plants, as shown in the image above.

This also illustrates a pattern that I’m keen on exploring more: Datasette plugins which detect something interesting about the data that is being displayed and automatically offer additional ways to explore and visualize it. datasette-cluster-map is still the best example of this, but other examples include datasette-render-binary (tries to detect the format of binary data) and datasette-leaflet-geojson (spots GeoJSON values and renders them as a map).

Join the conversation

More newsletter next week, but in the meantime I’d love to hear from you on Twitter (I’m @simonw) or in the Datasette GitHub Discussions. Thanks for reading!

Loading more posts…