Datasette enrichments and Datasette comments

Two new features for collaborative data analysis with Datasette

Dec 04, 2023

Two new features for Datasette this week: Datasette enrichments and Datasette comments. Combined, these features enable a powerful workflow for collaborative data analysis.

Datasette enrichments

Datasette Enrichments: a new plugin framework for augmenting your data introduces a new enrichments framework for Datasette.

An enrichment is code that can be executed against every row (or a filtered subset of rows) in a table, modifying or enhancing that data in some way - including importing new data from external APIs.

One example is geocoding: populating latitude and longitude columns based on the contents of an address column, using an external geocoder such as OpenCage:

Datasette screenshot: Enrich data in Film_Locations_in_San_Francisco. 2,084 rows selected. OpenCage geocoder. Geocode to latitude/longitude points using OpenCage. Geocode input: {{ Locations }}, San Francisco, California. Store JSON in column checkbox. Enrich data button.

Enrichments are provided by plugins. There are four plugins available today:

datasette-enrichments-opencage - geocoding (and reverse-geocoding) using OpenCage
datasette-enrichments-gpt - run prompts against text and images through OpenAI's GPT-3.5 and GPT-4 APIs, writing the results back to a column
datasette-enrichments-jinja - use sandboxed Jinja templates to populate new columns based on existing data
datasette-enrichments-re2 - execute regular expressions for search-and-replace or to extract data from text

Here's a video demonstrating the new feature, including how to use GPT-4 to extract structured data from unstructured text and how GPT-4 Vision can be used to provide detailed descriptions of linked images.

There are lots more details in the blog post announcement, including how to write your own enrichments.

Datasette comments

Alex Garcia has been working with me on Datasette Cloud, generously sponsered by Fly.io.

Alex’s latest project is datasette-comments, an open source plugin that enabling a new way of collaborating around data analysis. He introduces the new plugin in Annotate and explore your data with datasette-comments on the Datasette Cloud blog.

Screenshot of a table in Datasette - a comment thread by Alex Garcia and Cleo Paws has been attached to one of the rows

Authenticated users can attach and reply to comments on any row of data in any table. Comments support @-mentions, reaction emoji and hashtags, and a comment thread can be marked as resolved to move it to an archive page.

Datasette Cloud now has enrichments and comments

We’ve enabled both of these plugins on Datasette Cloud. If your team would benefit from collaboratively analyzing data in this way, please drop us a line by replying to this message!

Datasette Cloud and the Datasette 1.0 alphas

Plus a security vulnerability in the 1.0 alpha series

Simon Willison

Aug 22, 2023

Here's the latest news from the Datasette ecosystem.

Datasette Cloud

Datasette Cloud is the hosted SaaS version of Datasette. It exists to solve several problems:

I want teams to be able to use Datasette to collaborate on finding stories in their data - initially targeting newsrooms and data journalism, but the software is very clearly applicable to many other fields as well. Asking people to install software on their own server is a major barrier to entry, and I want to remove that barrier.
I'd like the project to be financially sustainable - not just to cover my own costs, but I want to grow a team of people to work full-time on Datasette and associated projects.
I want to make it as easy as possible for people to publish their data online. Datasette is great for this already... if you know how to use it (with Fly, Vercel, Cloud Run or similar). Datasette Cloud aims to make that even easier.

I wrote more about the goals of Datasette Cloud in the first post on the new Datasette Cloud blog: Welcome to Datasette Cloud.

The Datasette Cloud effort is partially sponsored by Fly.io, who are funding Alex Garcia to work with me on the project. Alex's first piece of work for that is the new datasette-write-ui, built initially for Datasette Cloud but available as an open source plugin for anyone to use. Alex introduced that in datasette-write-ui: a Datasette plugin for editing, inserting, and deleting rows on the Datasette Cloud blog.

If you want to try out Datasette Cloud you can request access to the preview today.

Datasette 1.0 alphas

Datasette 1.0 will be the version of Datasette that promises stability, in terms of the JSON API, the template context (so you can build custom templates without fear that they will break in future minor releases) and the API for plugins.

There have been four releases in the 1.0 alpha series so far:

1.0a0 introduced signed API tokens and the Datasette write JSON API.
1.0a1 expanded the write API with CORS support and other small features, and was accompanied by this blog entry showing detailed examples of what the write API can do.
1.0a2 added upsert support, and a mechanism for creating finely-grained access tokens. More in this blog post.
1.0a3 features a new default JSON output, one of the most significant milestones on the way to 1.0 final.
1.0a4 fixes a security vulnerability in the 1.0 alpha series, described next.

A vulnerability in the 1.0 alphas

I released 1.0a4 this morning, which fixes a security vulnerability in the 1.0 alpha series.

This affects you if you running a Datasette 1.0 alpha instance on a public server.

The bug is that an API explorer interface within Datasette (added in 1.0a0) could be visited by unauthenticated users and would reveal the names of the databases and tables in that instance - though not any of the actual table content.

You should upgrade as quickly as possible if:

You are running a Datasette 1.0 alpha instance on the public internet
That Datasette instance is authenticated using a plugin such as datasette-auth-passwords
There are private databases and tables in that instance where the names of those databases and tables should be considered sensitive information.

For more information read this GitHub security advisory: Datasette 1.0 alpha series leaks names of databases and tables to unauthenticated users.

The vulnerability was present on Datasette Cloud but has now been patched there, and a review of our logs showed that no unauthorized users had accessed that page across any of our hosted instances.

LLM: a new Datasette project

I've spent a lot of time over the past year immersed in research and exploration of the weird new world of LLMs - Large Language Models, the technology behind ChatGPT and GPT-4 and Bard and Bing and Claude and Llama and other similar projects.

Catching up on the weird world of LLMs is a 40m talk (plus detailed transcript, notes and links) I gave at North Bay Python in August which attempts to summarize everything I've learned about LLMs so far.

I'm increasingly optimistic about the role LLMs can play in the Datasette world, especially when combined with Datasette's plugin architecture.

You can read about some of my early explorations of LLMs and Datasette here:

How to implement Q&A against your documentation with GPT3, embeddings and Datasette describes my earliest attempt at building retrieval-augmented generation (RAG) against data in Datasette, where you take a user's question, find relevant content, then feed that to the LLM with "Given the above context, answer the following question: ..." appended to it.
I built a ChatGPT plugin to answer questions about data hosted in Datasette talks about my implementation of a plugin for ChatGPT that turns English questions into SQL queries and lets ChatGPT run those against Datasette and use the results to generate an answer.
Enriching data with GPT3.5 and SQLite SQL functions shows how my openai-to-sqlite CLI tool can run enrichments such as sentiment analysis against data in a SQLite database and store the results.
Storing and serving related documents with openai-to-sqlite and embeddings shows how to use the same tool to store embeddings for documents in a SQLite database, then use those embeddings to identify related documents, save those to a table and serve them up as a "Related" section on my TIL site.

These explorations have added up to a new project called LLM, a new command-line utility and Python library for working with LLMs.

The LLM project site is llm.datasette.io.

LLM is heavily inspired by both Datasette and sqlite-utils. It stores prompts and their responses to SQLite, providing you with a permanent structured log of your LLM interactions. It also provides a plugin architecture for adding new functionality.

Most importantly, the plugin architecture lets you add support for alternative models - including models like Llama 2 which you can run on your own machine!

The simplest version of LLM usage looks like this.

First, install it:

pip install llm

Or use pipx or Homebrew:

brew install llm

If you have an OpenAI API key you can configure that and start using LLM immediately:

llm keys set openai
<paste key here>

Then run a prompt:

llm 'Five nautical names for a pet pelican'

But where things get really fun is when you start adding LLM plugins to install and run alternative models on your own machine.

llm install llm-gpt4all
llm -m orca-mini-7b '3 names for a pet cow'

The first time you run this it will download the orca-mini-7b model from the GPT4All project and use it to run the prompt.

LLM has eight plugins already. My current favorite is llm-mlc, which uses the MLC framework to run models like Llama 2 using GPU acceleration on my M2 Mac laptop.

For more about LLM, read these blog entries:

Other Datasette news

The Datasette News page has short-form news about the project. The datasette tag on my blog is a lot noisier and more frequently updated.

I plan to send this newsletter out more often, especially given the increased pace of development towards 1.0.

Datasette Lite, Datasette Tutorials, Datasette Cloud

The Datasette Newsletter is back

Simon Willison

Aug 19, 2022

Datasette Lite

Datasette Lite provides a new way to run Datasette: entirely in your browser, thanks to WebAssembly and the Pyodide project.

Visit https://lite.datasette.io/ to try it out.

A screenshot of the pypi_packages database table running in Google Chrome in a page with the URL of lite.datasette.io/#/content/pypi_packages?_facet=author

You can also pass it the URL to a SQLite database, CSV file or SQL initialization script. Here are some examples:

https://lite.datasette.io/?url=https://congress-legislators.datasettes.com/legislators.db loads a database with historic Members of the United States Congress.
https://lite.datasette.io/?csv=https%3A%2F%2Fraw.githubusercontent.com%2Ffivethirtyeight%2Fdata%2Fmaster%2Ffight-songs%2Ffight-songs.csv loads US college fight songs from this CSV file in the FiveThirtyEight GitHub repository.

Read more about Datasette Lite in this series of articles on my blog:

Datasette Lite: a server-side Python web application running in a browser
Joining CSV files in your browser using Datasette Lite
Plugin support for Datasette Lite
Sort by number of JSON intersections, which shows how Datasette Lite can be used as a tool to demonstrate the answer to questions about how to do something using SQL

Datasette Tutorials

The official Datasette website now features tutorials to help people get started using Datasette:

Exploring a database with Datasette shows how to use the Datasette web interface to explore a new database.
Learn SQL with Datasette introduces SQL, and shows how to use that query language to ask questions of your data.
Cleaning data with sqlite-utils and Datasette shows how to use sqlite-utils to clean data, import it into SQLite and then explore it with Datasette. Includes a ten minute accompanying video.

Sign up for the preview of Datasette Cloud

Datasette Cloud is the new hosted SaaS service I am building to help teams run a private Datasette instance to share and collaborate on data analysis. It's nearly ready for preview users - if you'd like to try it out, please add yourself to the waiting list using the form on the homepage.

Datasette Cloud features include:

Upload CSVs and import data from external sources to populate your own, private Datasette instance
Invite coworkers to collaborate in your team's own dedicated space
Build search engines against your data
All of Datasette's standard features: filter and facet your data, execute SQL queries, generate visualizations and more

Community highlights

Philip James has been using Datasette to publish searchable versions of the minutes of city meetings for the cities of Alameda and Oakland:

https://data.oakland.works/oakland_minutes/pages lists 37,000 pages of Oakland civic meeting minutes dating back to 2003 - see Philip’s Twitter thread for details
https://data.alameda.one/alameda_minutes/pages provides 18,000 pages of Alameda city minutes - thread here

Alex Garcia has published a series of interactive posts about writing SQLite extensions:

Each post includes interactive demos powered Observable notebooks, some of which work by talking to a Datasette backend using Alex’s Datasette Client Observable integration.

Jim Crist-Harif released Ibis-Datasette, a Python library that lets you build SQL queries using a SQLAlchemy style chained query builder syntax, run them against a Datasette instance and get back the results in a DataFrame-like object.

In [1]: import ibis
In [2]: ibis.options.interactive = True
In [3]: con = ibis.datasette.connect(
   ...:    "https://congress-legislators.datasettes.com/legislators"
   ...: )
In [4]: legislators = con.tables["legislators"]
In [5]: legislators.groupby("bio_gender").count()
Out[5]:
  bio_gender  count
0          F    399
1          M  12195

Datasette Discord

The Datasette project now has an official presence on Discord! Join our Discord server here to talk about Datasette and the other projects that make up the wider Datasette ecosystem.

Join Datasette Discord

Datasette Desktop - a macOS application version of Datasette

Datasette on your laptop without installing Python

Simon Willison

Sep 14, 2021

The big news this month is Datasette Desktop, a brand new Mac application I've been building to help people run Datasette - and Datasette plugins - on their own computers.

An ongoing goal I have for Datasette is to make it available for as many people to use as possible.

I've tried a bunch of different strategies for this in the past:

Datasette is built using Python, so if you have a working Python environment you can install it using pip install datasette
I packaged it for Homebrew, so brew install datasette on macOS should work too
I have documentation for running it online using Glitch, where it can be installed using just your web browser
The datasette publish set of commands are designed to help deploy Datasette to Cloud Run, Heroku, Vercel or fly.io as easily as possible

Unfortunately, each of these options require some degree of deep technical knowledge before you can use them - familiarity with the command-line, or knowledge of how to deploy websites. These things are a significant barrier to entry!

My goal with Datasette Desktop is to make installing Datasette as easy as downloading and installing any other Mac application.

You can try that out here: click the Download link, open the zip file and drag the Datasette application to your /Applications folder.

The application comes with its own copy of Python 3.9 tucked away inside it, which means that even if you don’t have Python installed it will still be able to run just fine. It also pulls off some tricks to ensure that existing Datasett plugins can be installed using the “Plugins → Install and Manage Plugins…” menu option.

Once installed, the application can be used to open existing CSV or SQLite files on your computer. It can also open CSV files by URL, and uses that capability to offer examples that you can open to try it out (currently using data from the Central Park Squirrel Census and the London Fire Brigade’s list of animal rescue incidents).

Screenshot of the welcome screen, showing the buttons to import the two example CSVs

How I built it

Like Datasette itself, Datasette Desktop is open source. I’ve been building it entirely in the open in the simonw/datasette-app GitHub repository.

I’ve been writing up details of how it works as I’ve gone along:

Building a desktop application for Datasette describes the initial research into the project, and talks about how I’m building it using Electron.
Datasette Desktop—a macOS desktop application for Datasette was the initial launch announcement for the first installable version. It describes how the application works in some detail, including how I bundled Python inside the application and how I figured out signing and distribution of macOS apps using Electron.
Datasette Desktop 0.2.0: The annotated release notes describes the second release, which introduced some significant new features including a plugin management interface for installing, upgrading and uninstalling plugins and the ability to open SQLite and CSV files directly using the application.

I’ve also been collecting numerous TILs (Today I Learned) about Electron.

Feedback welcome

The application so far is just a starting point: I’m very keen to hear from people who have tried it out. Please leave questions, suggestions and feedback on the GitHub Discussions forum for the project.

Everything new in Datasette since January, plus Django SQL Dashboard

What's new in the Datasette ecosystem since January 2021

Simon Willison

Aug 10, 2021

It’s been a while since the last edition of this newsletter, so plenty to cover today - starting with what’s new in Datasette and sqlite-utils, and then introducing a new Datasette-like tool aimed at Django + PostgreSQL developers called Django SQL Dashboard.

What’s new in Datasette

Datasette 0.55 added support for cross-database SQL queries: you can now start Datasette with the new “--crossdb” option which enables cross-database joins! More details in the documentation or you can try out an example cross-database query here.

0.56 was mainly small bug fixes and documentation improvements, though it did also introduce a handle you can drag to resize the SQL editor.

0.57 brought a critical security fix - if you have any Datasette instances running on the public web you should upgrade to a version higher than this (or 0.56.1 which applies the same fix to 0.56) as soon as possible.

It also fixed a long-standing issue where SQL errors were displayed without letting you edit your original SQL, and added mechanisms for selectively showing and hiding columns on a table page (“?_col=” and “?_nocol”).

0.58 added Unix domain socket support, useful for people running Datasette behind an Apache or NGINX proxy server. It also added two new plugin hooks and a significant performance boost for faceting - all of which are covered in detail in the Datasette 0.58 annotated release notes.

What’s new in sqlite-utils

Datasette’s close companion is sqlite-utils - a combination CLI tool and Python library for creating and manipulating SQLite databases.

sqlite-utils gained two key new features this year.

sqlite-utils memory is a new command that lets you pipe JSON and CSV data directly into a temporary, in-memory SQLite database, execute a SQL query that filters or joins that data, then returns the results as tabular, CSV or JSON output.

It turns sqlite-utils into an ad-hoc querying tool that works against all sorts of data, and can be used to combine data from different sources - effectively running joins between JSON, CSV and other data sources.

I recorded a YouTube video showing the new tool in action:

Here’s an example that fetches JSON from the GitHub API, executes a SQL query to select three columns and sort by the number of stars, then outputs the result to the console as CSV:

$ curl -s 'https://api.github.com/users/dogsheep/repos' \
  | sqlite-utils memory - '
      select full_name, forks_count, stargazers_count as stars
      from stdin order by stars desc limit 3
    ' --csv
full_name,forks_count,stars
dogsheep/twitter-to-sqlite,12,225
dogsheep/github-to-sqlite,14,139
dogsheep/dogsheep-photos,5,116

I wrote more about this in Joining CSV and JSON data with an in-memory SQLite database.

The sqlite-utils convert command is the other big new feature: it lets you apply a conversion function written in Python to every value in a column in your SQLite database.

Here’s how to remove commas from a column with 123,456 style numbers:

sqlite-utils convert states.db states count \
    'value.replace(",", "")'

And this will split a “location” column into two separate “latitude” and “longitude” columns (using the --multi option):

sqlite-utils convert data.db places location '
latitude, longitude = value.split(",")
return {
    "latitude": float(latitude),
    "longitude": float(longitude),
}' --multi

Read Apply conversion functions to data in SQLite columns with the sqlite-utils CLI tool for more.

Django SQL Dashboard

I spent the first few months of the year mainly not working on Datasette at all: I joined the team running VaccinateCA / Vaccinate The States to build a Django application to help crowdsource information about places that people could go to get vaccinated against COVID-19 in the USA.

My first order of business was to help replace their Airtable infrastructure - which was rapidly reaching the size limits of what could be kept in Airtable - with a Django equivalent.

Since a major benefit of building on Airtable was its ad-hoc reporting abilities, I found myself really needing something similar to Datasette that could work against a PostgreSQL database.

So I built exactly that: Django SQL Dashboard - an open source package that adds a Datasette-like querying interface to a Django+PostgreSQL project, protected by the Django authentication layer.

I recorded this video with a demo of the project:

Lots more detail in the Django SQL Dashboard launch announcement.

I’m still deciding if and when to bring the lessons from that project back into Datasette itself - having Datasette work against PostgreSQL in addition to SQLite would open up a huge new set of possibilities for the project, and my work on Django SQL Dashboard has helped me start to scope out how much work it would be to make that a reality.

If you’re interested in reading more about my work at VaccinateCA I’ve imported my writing from my internal blog at the organization into my regular blog - you can find that series of posts here: simonwillison.net/series/vaccinateca

New Datasette Plugins

I decided to check how many new Datasette plugins I’ve released since the last newsletter on January 26th. The answer turned out to be seven! Here’s the query I used:

select
  repos.full_name,
  tag_name,
  min(releases.created_at) as launched_at
from
  releases
  join repos on releases.repo = repos.id
where
  repos.full_name like 'simonw/datasette-%'
  and repos.full_name not in (
  -- These are older but the first tagged release was this year
    'simonw/datasette-jellyfish',
    'simonw/datasette-haversine'
  )
group by
  repo
having
  launched_at > '2021-01-26'
order by
  launched_at

Run that here.

datasette-tiles is a plugin for serving map tiles directly out of a SQLite database, using the MBTiles standard for bundling tiles in a DB file.

datasette-basemap is a related plugin that bundles the first four layers of tiles from OpenStreetMap in a SQLite database file which can be installed in a place where Datasette can find it using “pip install datasette-basemap”.

Here’s a demo of the plugin, or you can browse the underlying database of tile images.

I wrote more about these plugins in Serving map tiles from SQLite with MBTiles and datasette-tiles.

datasette-block is a plugin that lets you block all access to specific path prefixes within Datasette. You probably don’t need this one!

datasette-placekey adds SQL functions for working with the placekey mechanism for creating identifiers for locations. I built this for some analysis at VaccinateCA.

datasette-remote-metadata lets you define your Datasette metadata in an external file and have Datasette fetch it on-demand while it is running. This is useful for if you have a large (1GB+) database running somewhere like Cloud Run where deploying new metadata can take several minutes - with the plugin you can edit the remote file and have the changes go live a few seconds later.

I built this as part of working with Stanford’s Big Local News to help launch the Stanford School Enrollment Project, which makes available detailed school enrollment figures for schools and districts across the USA from this year back to 2015.

datasette-pyinstrument adds the ability to add ?_pyinstrument=1 to any Datasette URL in order to see the output of a profile run while constructing the page.

datasette-query-links is a highly experimental plugin which looks out for SQL queries that return strings that are themselves valid SQL queries, and turns those strings into links that will execute the SQL. I thought this was a neat and unique idea, until I found out that PostgreSQL already has a version of this as a built in feature called \gexec!

What’s next

I’ve decided to try and get this newsletter back onto a roughly monthly cadence. In the meantime, I suggest keeping an eye on the official Datasette website at datasette.io - you can even subscribe to its Atom feeds!

And a reminder: I’m still running Datasette Office Hours every Friday, so if you’d like to have a 25 minute video chat with me about Datasette to talk about things you’re working on or provide feedback on the project, sign up for a slot!

Loading more posts…