Datasette Cloud and the Datasette 1.0 alphas

Plus a security vulnerability in the 1.0 alpha series

Aug 22, 2023

Here's the latest news from the Datasette ecosystem.

Datasette Cloud

Datasette Cloud is the hosted SaaS version of Datasette. It exists to solve several problems:

I want teams to be able to use Datasette to collaborate on finding stories in their data - initially targeting newsrooms and data journalism, but the software is very clearly applicable to many other fields as well. Asking people to install software on their own server is a major barrier to entry, and I want to remove that barrier.
I'd like the project to be financially sustainable - not just to cover my own costs, but I want to grow a team of people to work full-time on Datasette and associated projects.
I want to make it as easy as possible for people to publish their data online. Datasette is great for this already... if you know how to use it (with Fly, Vercel, Cloud Run or similar). Datasette Cloud aims to make that even easier.

I wrote more about the goals of Datasette Cloud in the first post on the new Datasette Cloud blog: Welcome to Datasette Cloud.

The Datasette Cloud effort is partially sponsored by Fly.io, who are funding Alex Garcia to work with me on the project. Alex's first piece of work for that is the new datasette-write-ui, built initially for Datasette Cloud but available as an open source plugin for anyone to use. Alex introduced that in datasette-write-ui: a Datasette plugin for editing, inserting, and deleting rows on the Datasette Cloud blog.

If you want to try out Datasette Cloud you can request access to the preview today.

Datasette 1.0 alphas

Datasette 1.0 will be the version of Datasette that promises stability, in terms of the JSON API, the template context (so you can build custom templates without fear that they will break in future minor releases) and the API for plugins.

There have been four releases in the 1.0 alpha series so far:

1.0a0 introduced signed API tokens and the Datasette write JSON API.
1.0a1 expanded the write API with CORS support and other small features, and was accompanied by this blog entry showing detailed examples of what the write API can do.
1.0a2 added upsert support, and a mechanism for creating finely-grained access tokens. More in this blog post.
1.0a3 features a new default JSON output, one of the most significant milestones on the way to 1.0 final.
1.0a4 fixes a security vulnerability in the 1.0 alpha series, described next.

A vulnerability in the 1.0 alphas

I released 1.0a4 this morning, which fixes a security vulnerability in the 1.0 alpha series.

This affects you if you running a Datasette 1.0 alpha instance on a public server.

The bug is that an API explorer interface within Datasette (added in 1.0a0) could be visited by unauthenticated users and would reveal the names of the databases and tables in that instance - though not any of the actual table content.

You should upgrade as quickly as possible if:

You are running a Datasette 1.0 alpha instance on the public internet
That Datasette instance is authenticated using a plugin such as datasette-auth-passwords
There are private databases and tables in that instance where the names of those databases and tables should be considered sensitive information.

For more information read this GitHub security advisory: Datasette 1.0 alpha series leaks names of databases and tables to unauthenticated users.

The vulnerability was present on Datasette Cloud but has now been patched there, and a review of our logs showed that no unauthorized users had accessed that page across any of our hosted instances.

LLM: a new Datasette project

I've spent a lot of time over the past year immersed in research and exploration of the weird new world of LLMs - Large Language Models, the technology behind ChatGPT and GPT-4 and Bard and Bing and Claude and Llama and other similar projects.

Catching up on the weird world of LLMs is a 40m talk (plus detailed transcript, notes and links) I gave at North Bay Python in August which attempts to summarize everything I've learned about LLMs so far.

I'm increasingly optimistic about the role LLMs can play in the Datasette world, especially when combined with Datasette's plugin architecture.

You can read about some of my early explorations of LLMs and Datasette here:

How to implement Q&A against your documentation with GPT3, embeddings and Datasette describes my earliest attempt at building retrieval-augmented generation (RAG) against data in Datasette, where you take a user's question, find relevant content, then feed that to the LLM with "Given the above context, answer the following question: ..." appended to it.
I built a ChatGPT plugin to answer questions about data hosted in Datasette talks about my implementation of a plugin for ChatGPT that turns English questions into SQL queries and lets ChatGPT run those against Datasette and use the results to generate an answer.
Enriching data with GPT3.5 and SQLite SQL functions shows how my openai-to-sqlite CLI tool can run enrichments such as sentiment analysis against data in a SQLite database and store the results.
Storing and serving related documents with openai-to-sqlite and embeddings shows how to use the same tool to store embeddings for documents in a SQLite database, then use those embeddings to identify related documents, save those to a table and serve them up as a "Related" section on my TIL site.

These explorations have added up to a new project called LLM, a new command-line utility and Python library for working with LLMs.

The LLM project site is llm.datasette.io.

LLM is heavily inspired by both Datasette and sqlite-utils. It stores prompts and their responses to SQLite, providing you with a permanent structured log of your LLM interactions. It also provides a plugin architecture for adding new functionality.

Most importantly, the plugin architecture lets you add support for alternative models - including models like Llama 2 which you can run on your own machine!

The simplest version of LLM usage looks like this.

First, install it:

pip install llm

Or use pipx or Homebrew:

brew install llm

If you have an OpenAI API key you can configure that and start using LLM immediately:

llm keys set openai
<paste key here>

Then run a prompt:

llm 'Five nautical names for a pet pelican'

But where things get really fun is when you start adding LLM plugins to install and run alternative models on your own machine.

llm install llm-gpt4all
llm -m orca-mini-7b '3 names for a pet cow'

The first time you run this it will download the orca-mini-7b model from the GPT4All project and use it to run the prompt.

LLM has eight plugins already. My current favorite is llm-mlc, which uses the MLC framework to run models like Llama 2 using GPU acceleration on my M2 Mac laptop.

For more about LLM, read these blog entries:

Other Datasette news

The Datasette News page has short-form news about the project. The datasette tag on my blog is a lot noisier and more frequently updated.

I plan to send this newsletter out more often, especially given the increased pace of development towards 1.0.

Datasette Newsletter

Discussion about this post