Local Development (k3d)

View as Markdown

Run the full NVCF self-hosted control plane on your laptop using k3d for development, testing, or demos.

This setup is for local development only. It uses fake GPUs, a single Cassandra replica, and ephemeral storage. Do not use this for production workloads.

Assumptions

This guide assumes:

  • Helm charts are pulled from the NGC registry (nvcr.io/0833294136851237/nvcf-ncp-staging)
  • Container images are pulled from the same NGC registry
  • Image pull secrets are configured in the environment YAML using imagePullSecrets to authenticate with NGC

If you are using a different registry (e.g., Amazon ECR, a private Harbor instance, or a local mirror), update the helm.sources and image sections in the environment file and adjust the pull secret configuration accordingly. See self-hosted-image-mirroring for details on mirroring artifacts to other registries.

A ready-to-use k3d configuration and setup script is available in the nv-cloud-function-helpers repository. Clone it and run ./setup.sh to create the cluster with all prerequisites. The script is the source of truth for local cluster bootstrap. The manual commands below are for debugging and recovery. After the script completes, skip to [Deploy the NVCF Stack].

Prerequisites

Install the following tools:

  • Docker (running)
  • k3d v5.x or later
  • kubectl
  • helm >= 3.12
  • helmfile >= 1.1.0, < 1.2.0
  • helm-diff plugin (helm plugin install https://github.com/databus23/helm-diff)
  • NGC API Key from ngc.nvidia.com with access to the NVCF chart/image registry

Step 1: Create the k3d Cluster

Save the following configuration as k3d-config.yaml:

1apiVersion: k3d.io/v1alpha5
2kind: Simple
3metadata:
4 name: ncp-local
5
6image: rancher/k3s:v1.30.2-k3s2
7servers: 1
8agents: 5
9
10ports:
11 - port: 8080:80
12 nodeFilters:
13 - loadbalancer
14 - port: 8443:443
15 nodeFilters:
16 - loadbalancer
17
18options:
19 k3d:
20 wait: true
21 k3s:
22 extraArgs:
23 - arg: "--disable=traefik"
24 nodeFilters:
25 - server:*
26 nodeLabels:
27 - label: run.ai/simulated-gpu-node-pool=default
28 nodeFilters:
29 - agent:3
30 - agent:4
31 - label: nvidia.com/gpu.family=hopper
32 nodeFilters:
33 - agent:3
34 - agent:4
35 - label: nvidia.com/gpu.machine=NVIDIA-DGX-H100
36 nodeFilters:
37 - agent:3
38 - agent:4
39 - label: nvidia.com/cuda.driver.major=535
40 nodeFilters:
41 - agent:3
42 - agent:4

This creates a 6-node cluster: 1 server (control plane) and 5 agents. Agents 3 and 4 are pre-labeled for the fake GPU operator. Traefik is disabled because NVCF uses Envoy Gateway.

Create the cluster:

$k3d cluster create --config k3d-config.yaml

Verify:

$kubectl get nodes
$# Expected: 6 nodes (1 server + 5 agents), all Ready

Step 2: Install the Fake GPU Operator

The fake GPU operator simulates GPU resources on the pre-labeled nodes so the NVCA agent can discover them. See fake-gpu-operator for full details.

$# Install KWOK (required by the fake GPU operator)
$kubectl apply -f https://github.com/kubernetes-sigs/kwok/releases/download/v0.7.0/kwok.yaml
$kubectl wait --for=condition=Available deployment/kwok-controller -n kube-system --timeout=60s
$
$# Install the fake GPU operator
$helm repo add fake-gpu-operator \
> https://runai.jfrog.io/artifactory/api/helm/fake-gpu-operator-charts-prod --force-update
$
$helm upgrade -i gpu-operator fake-gpu-operator/fake-gpu-operator \
> -n gpu-operator --create-namespace \
> --set 'topology.nodePools.default.gpuCount=8' \
> --set 'topology.nodePools.default.gpuProduct=NVIDIA-H100-80GB-HBM3' \
> --set 'topology.nodePools.default.gpuMemory=81559'

If Helm fails with RuntimeClass "nvidia" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata, rerun the helper repository ./setup.sh. The script removes known stale fake GPU operator resources without deleting the k3d cluster. For the canonical recovery workflow, see examples/self-hosted-local-development/README.md.

Verify fake GPUs appear on the labeled nodes:

$kubectl get nodes -o custom-columns="NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
$# Agents 3 and 4 should show GPU: 8

Step 3: Install CSI SMB Driver

The CSI SMB driver is required for NVCA shared model cache storage:

$helm repo add csi-driver-smb \
> https://raw.githubusercontent.com/kubernetes-csi/csi-driver-smb/master/charts
$
$helm install csi-driver-smb csi-driver-smb/csi-driver-smb \
> -n kube-system --version v1.17.0

Deploy the NVCF Stack

With the cluster ready, use the Quickstart for the one-click CLI flow. The local k3d quickstart installs the stack, registers the local cluster, and installs NVCA.

The steps below document the manual Helmfile flow and call out the local-specific differences for each step.

Step 1 (Ingress)

Follow as documented, but skip the cloud-provider annotations on the Gateway resource. k3d handles LoadBalancer services automatically via its built-in klipper-lb.

Step 2 (Environment file)

Create a local development environment file from the template below (local-dev-env.yaml). Save it as environments/<name>.yaml (e.g., environments/my-local.yaml) in your nvcf-self-managed-stack directory.

environments/my-local.yaml
1# NVCF Self-Hosted Local Development Environment
2# For use with k3d clusters. See the Local Development guide for setup instructions.
3#
4# Save this file as environments/<name>.yaml in your nvcf-self-managed-stack directory.
5# Create a matching secrets/<name>-secrets.yaml file with your registry credentials.
6# Deploy with: HELMFILE_ENV=<name> helmfile sync
7
8global:
9 # Domain for local access (routes use .localhost TLD)
10 domain: "localhost"
11
12 # Helm chart registry (where helmfile pulls OCI charts from)
13 helm:
14 sources:
15 registry: nvcr.io
16 repository: 0833294136851237/nvcf-ncp-staging
17
18 # Container image registry (where Kubernetes pulls images from)
19 image:
20 registry: nvcr.io
21 repository: 0833294136851237/nvcf-ncp-staging
22
23 # Pull secret created by create-nvcr-pull-secrets.sh (run once before deploying)
24 imagePullSecrets:
25 - name: nvcr-pull-secret
26
27 # Disable node selectors for local development (pods schedule on any node)
28 nodeSelectors:
29 enabled: false
30
31 # k3d uses the local-path StorageClass by default
32 storageClass: local-path
33 storageSize: 2Gi
34
35 observability:
36 tracing:
37 enabled: false
38 collectorEndpoint: ""
39 collectorPort: 4317
40 collectorProtocol: http
41
42# Single Cassandra replica for local development
43cassandra:
44 enabled: true
45 replicaCount: 1
46 jvm:
47 # Fast startup options -- only safe with a single replica.
48 # Do NOT use these settings with multiple replicas.
49 extraOpts: "-Dcassandra.superuser_setup_delay_ms=100 -Dcassandra.gossip_settle_min_wait_ms=1000"
50
51nats:
52 enabled: true
53
54openbao:
55 enabled: true
56 migrations:
57 issuerDiscovery:
58 enabled: true
59
60# Gateway configuration matching the local k3d setup script
61ingress:
62 gatewayApi:
63 enabled: true
64 controllerNamespace: envoy-gateway-system
65 routes:
66 nvcfApi:
67 routeAnnotations: {}
68 apiKeys:
69 routeAnnotations: {}
70 invocation:
71 routeAnnotations: {}
72 grpc:
73 routeAnnotations: {}
74 gateways:
75 shared:
76 name: shared-gw
77 namespace: envoy-gateway-system
78 listenerName: http
79 grpc:
80 name: grpc-gw
81 namespace: envoy-gateway-system
82 listenerName: tcp

This template is pre-configured for local development:

  • Storage: local-path (2Gi volumes, the default k3d StorageClass)
  • Cassandra: Single replica with fast startup JVM options
  • Node selectors: Disabled (pods schedule on any available node)
  • Registry: nvcr.io/0833294136851237/nvcf-ncp-staging
  • Gateway: shared-gw and grpc-gw in envoy-gateway-system namespace
  • Domain: localhost
  • imagePullSecrets: Pre-configured to reference nvcr-pull-secret (created in Step 4)

Step 3 (Secrets)

Create secrets/<name>-secrets.yaml (e.g., secrets/my-local-secrets.yaml) from the template in the control plane guide. The file name must match your environment name. Fill in your NGC base64-encoded credentials for the NGC org you’ll be deploying function images from:

$echo -n '$oauthtoken:YOUR_NGC_API_KEY' | base64

Step 4 (Pull secrets)

Run the helper script to create the nvcr-pull-secret Kubernetes secret in all NVCF namespaces:

$export NGC_API_KEY="<your-ngc-api-key>"
$bash samples/scripts/create-nvcr-pull-secrets.sh

The environment file template from Step 2 already references this secret via imagePullSecrets.

Step 5 (Deploy)

Authenticate helm and deploy using your environment name:

$helm registry login nvcr.io -u '$oauthtoken' -p "$NGC_API_KEY"
$HELMFILE_ENV=<name> helmfile sync

Replace <name> with the name you chose for your environment file (e.g., my-local).

Step 6 (Verify)

Check that all pods are running:

$kubectl get pods -A -o wide
$# All pods should be Running or Completed
$
$helm list -A
$# All releases should show STATUS: deployed

Verify the NVCA agent discovered the fake GPUs:

$kubectl get nvcfbackends -n nvca-operator
$# Expected: nvcf-default healthy
$
$kubectl get nvcfbackends -n nvca-operator -o jsonpath='{.items[0].status.gpuUsage}' | python3 -m json.tool
$# Expected: {"H100": {"available": 16, "capacity": 16}}

Verify API connectivity using the .localhost routing (not the Gateway address, which is cluster-internal on k3d):

$# Generate an admin token
$export NVCF_TOKEN=$(curl -s -X POST "http://api-keys.localhost:8080/v1/admin/keys" \
> | python3 -c "import sys,json; print(json.load(sys.stdin)['value'])")
$
$echo "Token: ${NVCF_TOKEN:0:20}..."
$
$# List functions (should return empty)
$curl -s "http://api.localhost:8080/v2/nvcf/functions" \
> -H "Authorization: Bearer ${NVCF_TOKEN}" | python3 -m json.tool
$# Expected: {"functions": []}

The standard control plane verification commands use the Gateway address from kubectl get gateway. On k3d this returns a cluster-internal IP that is not reachable from the host. Use localhost:8080 with .localhost hostnames instead, as shown above.

Accessing Routes Locally

NVCF routes use the .localhost top-level domain, which resolves to 127.0.0.1 automatically on most systems. Access services via the k3d load balancer on port 8080:

  • http://api.localhost:8080 — NVCF API
  • http://api-keys.localhost:8080 — API Keys service
  • http://sis.localhost:8080 — SIS service used during cluster registration
  • http://invocation.localhost:8080 — Function invocation

If .localhost does not resolve automatically, add entries to /etc/hosts:

127.0.0.1 api.localhost
127.0.0.1 api-keys.localhost
127.0.0.1 sis.localhost
127.0.0.1 invocation.localhost

Wildcard subdomains (e.g., <function-id>.invocation.localhost) cannot be added to /etc/hosts. For local testing with dynamic function IDs, add specific entries or use a local DNS resolver such as dnsmasq.

Teardown

$# Remove the NVCF stack (use your environment name)
$HELMFILE_ENV=<name> helmfile destroy
$
$# Delete the k3d cluster
$k3d cluster delete ncp-local

Limitations

  • Fake GPUs - Function containers will be scheduled and deployed but cannot execute actual GPU workloads.
  • Single Cassandra replica - No high availability. Data may be lost on pod restart.
  • Ephemeral storage - local-path volumes are deleted when the cluster is destroyed.
  • Not suitable for performance testing - Resource constraints of a laptop do not represent production environments.