Skip to main content
Version: PromptQL

Huggingface Datasets

In this guide, you will learn how to import any dataset from Huggingface (like CSV, Parquet, and SQLite files) and connect it to PromptQL to be able to query using natural language.

Check out the GitHub repo.

Prerequisites

Install the DDN CLI

Minimum version requirements

To use this guide, ensure you've installed/updated your CLI to at least v2.28.0.

Simply run the installer script in your terminal:

curl -L https://graphql-engine-cdn.hasura.io/ddn/cli/v4/get.sh | bash
ARM-based Linux Machines

Currently, the CLI does not support installation on ARM-based Linux systems.

Install Docker

The Docker-based workflow helps you iterate and develop locally without deploying any changes to Hasura DDN, making the development experience faster and your feedback loops shorter. You'll need Docker Compose v2.20 or later.

Validate the installation

You can verify that the DDN CLI is installed correctly by running:

ddn doctor

Import your Huggingface Dataset

Step 1: Clone the project

git clone [email protected]:hasura/huggingface-dataset-promptql.git
cd huggingface-dataset-promptql

Step 2: Get the Huggingface Dataset ID handy

In order to import the dataset, we need to configure the Dataset ID with the path to the file(s).

For example, here's a top 1000 IMDB dataset - https://huggingface.co/datasets/drossi/EDA_on_IMDB_Movies_Dataset. Now the dataset ID for this would be something like:

drossi/EDA_on_IMDB_Movies_Dataset/*.csv

Notice the usage of glob patterns for selecting all the CSV files. Replace this with the dataset of choice along with the path to the files. This should work for any ".csv", ".parquet" or ".sqlite" files in Huggingface.

Refer to this DuckDB blog for more examples of this format.

Step 3: Add Anthropic API key to .env

Set up your .env file with your Anthropic (or OpenAI) API key and GitHub API token.

cp .env.example .env

Get an api key from https://console.anthropic.com/settings/keys

# .env
...
...
ANTHROPIC_API_KEY=<your-anthropic-api-key>

To use an OpenAI key instead, you'll have to set OPENAI_API_KEY in your .env file and change the environment variable LLM to openai in the compose.yaml file.

Step 4: Configure .env for huggingface

Head to the app/connector/huggingface directory to now configure the Dataset.

cd app/connector/huggingface
cp .env.sample .env

Modify the values for the HUGGINGFACE_DATASET ENV. It is of the format: "user/dataset/file-path".

For example:

HUGGINGFACE_DATASET="drossi/EDA_on_IMDB_Movies_Dataset/*.csv"

The IMDB example mentioned in the sample env is available as a sample dataset to choose. Feel free to configure the dataset that you would like to.

Step 5: Introspect the Huggingface Connector

ddn connector introspect huggingface --log-level=DEBUG

Note: Depending on how big the dataset is, it should take sometime to fully import the data. The schema will be initialized quickly and the data import happens in the background, so you can proceed to follow the steps below.

The above command runs in DEBUG mode to make it easy to catch errors for invalid files.

Step 6: Add Models

Based on the dataset imported, a SQL schema would be generated. Let's track all the models to get started quickly.

ddn model add huggingface "*"

Build your PromptQL app

Now, let's set up the Hasura DDN project with PromptQL to start exploring the data in natural language!

  • Set up the Hasura DDN project already scaffolded in the repo:

In the root directory of the repo, run the following commands:

ddn supergraph build local
ddn project init
  • Start the DDN project

Let's start the DDN project by executing the following command:

ddn run docker-start
  • Open the local DDN Console to start exploring:
ddn console --local

This should open up your browser (or print a browser URL) for displaying the Hasura Console. It’ll typically be something like: https://console.hasura.io/local?engine=localhost:3280&promptql=localhost:3282

Ask questions about your dataset

The app will have metadata about the dataset that you just imported above. You should be able to ask domain specific questions and play around with the data.

Here's a sample of what you can ask to get started.

  • Hi, what can you do?

Depending on the dataset schema, PromptQL will tell you what it can answer and you can go from there.

Clean up and restart your app

If you want to reset the data and start from scratch:

You can stop the ddn run docker-start command, whereever it is running and you can execute the following in the root directory of the repo:

docker compose down -v && ddn run docker-start