Own Your AI - Hardware, Models, Tuning: How to Actually Make Local LLMs Productive
Technology

Own Your AI - Hardware, Models, Tuning: How to Actually Make Local LLMs Productive

Wed, Jun 24
06:00 PM09:00 PM
4futureFree · See website
About the event

We have a few places left. Apply to the waiting list with your motivation to get a seat

AI is becoming infrastructure. The question is no longer whether you use it — it's who controls it, who pays for it, and who sees your data.
Own Your AI is a new engineering group for practitioners who want to own AI, not rent it. Sovereignty over your models, your data, and your costs. Independence from a handful of hyperscalers. Mastery from the token up to the hardware. A community of people who actively build systems and want to exchange notes with peers who do the same.
For Event #1, we start exactly where most local-AI projects fail: at hardware decisions and inference tuning. Two talks from practitioners who have done this in real client engagements, plus lightning sessions and open discussion.
Who is this for?
LLM Engineers & Builders — you work with prompts, agents, and models every day and want to get more out of your local stack than the defaults deliver
Platform & Infrastructure Engineers — you run inference servers, fight with VRAM, KV cache, and tokens per second, and have to hit SLAs
Architects — you decide whether a new system runs on-prem, hybrid, or in the cloud, and you need solid numbers instead of vendor benchmarks
Engineering Leaders & Decision Makers — you plan budget and roadmap for AI workloads and want to know at which use case local actually pays off
Compliance, Legal & Security — you have to reconcile data residency, EU AI Act, and audit requirements with what the engineering teams want to deploy

Agenda
Talk 1 — Andreas Petersson
The Decision Matrix: Which LLM on Which Hardware — and When the Cloud Is the More Honest Answer
A practitioner's guide through the current inference hardware market — from 2,000-euro mini PCs to the H100. Which models actually run well on which system, where are the real limits, and at what workload does the math tip in favour of private or public cloud?
We walk through the relevant options one by one — Mac Studio as the unified-memory workhorse, Mac Mini as the cheapest entry point, AMD Ryzen AI Max+ 395 (e.g. as the Zotac ZBOX Magnus) as a new x86 contender with a large unified-memory pool, Nvidia consumer GPUs (RTX 4090 5090) for maximum tokens per second in single-user mode, Nvidia enterprise H100 A100 for multi-tenant inference, and private vs. public cloud services (dedicated EU providers, Bedrock & Co.) as the comparison anchor. For each platform: which model sizes and quantisations are realistic, which context windows work without swap, and what the total cost of ownership actually adds up to.
What we'll look at:
A concrete Decision Matrix: use case → model class → minimal viable hardware
Realistic tokens-per-second numbers for 7B, 14B, 30B, and 70B+ models across platforms
Memory-bandwidth and VRAM limits where most setups fail in production
A direct cost comparison: CapEx (hardware) vs. OpEx (cloud)
When unified memory (Apple, AMD) beats the GPU — and when it doesn't
Hybrid architectures: local vs. cloud for burst workloads

Talk 2 — Andreas Burner
3× Faster, Same Hardware: 10 Tuning Knobs That Turn Your Local LLM Stack From a Toy Into a Tool
Most local LLM setups run on default parameters — and deliver a fraction of what the hardware can do. This talk shows how systematic tuning of llama.cpp, Ollama, and LM Studio took throughput from 0.3 tok/s to 6 tok/s on unchanged hardware — and why the same levers apply to vLLM on OpenShift AI.
Inference parameters do not act in isolation: temperature interacts with sampling, context size with KV cache and parallel slots, thread count with CPU topology. Two parameters set right and one bad default cost you most of the performance. We walk through the ten most important knobs one by one — measured against real coding benchmarks, not synthetic perplexity scores — and look at what becomes possible with an agent harness when the inference layer is finally configured correctly.
What we'll look at:
Batch and context sizing — significantly faster prompt processing without swapping hardware
Parallel slots & KV cache — the most common reason local setups end up in swap
CPU thread pinning on Intel hybrid cores — up to 3× throughput from a single setting
KV cache quantisation — when it wins on CPU and silently costs quality on GPU
Reasoning budget for thinking models — why the wrong cap is worse than no thinking at all
Context window sync between agent harness and inference server — the invisible bug behind aborted long-context tasks
Translation to vLLM / OpenShift AI — what carries over, what changes, what becomes more important in multi-tenant setups

Format
Two focus talks, lightning sessions, open discussion, and networking
Language: talks in English or German — the audience decides
Location: Vienna, in person only. No online attendance.
Cadence: every 1–2 months

Hosts
Andreas Burner — Management Advisor for AI & Cloud Strategy, Board Advisor, and Court-Certified Expert. 25+ years in global enterprise technology. BurnerNet.com
Andreas Petersson — Technology Advisor and Founder of Capacity (capacity.at). Two decades of experience building and auditing decentralised, security-critical systems for enterprise and the public sector.

Registration
Seats are limited — we deliberately keep the session small so that discussion and networking actually work. Register here via Luma. You'll receive the exact location, the final speaker line-up, and the agenda in good time before the event.

Own the stack. Own the data. Own the future.
More Information: https://OwnYourAI.eu

Location

4future

Get directions

This week in Sverige

More events in Sverige

See website