Qwen/Qwen3-VL-2B-Instruct
Qwen/Qwen3-VL-2B-Instruct
Search 7,438+ AI models, get instant VRAM calculations, and find hardware recommendations for Llama, Qwen, DeepSeek & more — free, no signup.
32 guides · 21 hardware options · Updated 2026 · Free, no signup
Hand-picked across budgets and ecosystems. Affiliate links to Amazon. No extra cost to you.
Per-card deep dives: RTX, RX, Arc
GPU vs GPU, Mac vs PC, AMD vs NVIDIA
Best picks by budget and use case
Ollama, llama.cpp, LM Studio setup
VRAM tables, benchmarks, requirements
Multi-GPU, fine-tuning, ROCm, quantization
New to Local AI? Start Here
Run AI like ChatGPT on your own PC — free, private, offline. Ollama in one click, first model running in 10 minutes. Complete beginner guide.
Read guide → NewLFM2.5-8B-A1B Hardware Requirements
Liquid AI 8.3B MoE with 1.5B active params and 128K context. Q4 fits in ~6 GB; runs on an RTX 4060 8GB and any 12 GB+ card comfortably. Day-one MLX support.
Read guide → TrendingDeepSeek Hardware Requirements
What GPU to run DeepSeek R1 7B, 14B, 32B, and 70B locally — per-GPU compatibility table.
Read guide → NewQwen3 Hardware Requirements
Run Qwen3 4B–32B locally on any GPU. VRAM requirements, quantization tips, and model picks.
Read guide → NewQwen3.5 Hardware Requirements
Qwen3.5-27B needs 15 GB at Q4 (RTX 4090). 35B-A3B MoE needs 20 GB. Thinking mode overhead and Ollama setup.
Read guide → NewQwen3.6 Hardware Requirements
Qwen3.6 improves on 3.5 with identical VRAM needs. 27B on RTX 4090, 35B-A3B on Mac Studio M4 Max 64 GB.
Read guide → NewKimi K2 Hardware Requirements
Kimi K2 is a 1T MoE model — it cannot run on consumer hardware. What CAN you run: Kimi-VL-A3B, distilled 7B–32B variants.
Read guide → NewCPU Offloading for LLMs
Run models too large for your VRAM using -ngl layer splitting. Speed expectations, Ollama setup, and when to skip it.
Read guide → NewMistral Large Hardware Requirements
Mistral Large 2 (123B) needs 70 GB VRAM. Runs on Mac Studio M4 Max 128 GB. Best consumer alternative: Mistral Small 3.1 on RTX 4090.
Read guide → NewWhy Is My LLM Slow?
6 causes with specific fixes: memory bandwidth, CPU offloading penalty, KV cache bloat, wrong quantization, thermal throttling.
Read guide → NewGPU VRAM Per Dollar Tier List
Every GPU ranked by VRAM per dollar. Intel Arc B580 leads on cost per GB. RTX 3090 used is the best all-around value.
Read guide → NewHow Much VRAM Do I Need?
Exact VRAM for every model size: 7B needs 5 GB, 13B needs 9 GB, 70B needs 38 GB at Q4. Includes KV cache overhead tables.
Read guide → NewLLM Inference Speed: Tokens Per Second by GPU
Benchmark tok/s for RTX 4060 through RTX 5090. Why memory bandwidth matters more than CUDA cores.
Read guide → NewQwen3-30B-A3B Hardware Requirements
MoE misconception guide: 30B-A3B needs 17 GB VRAM, NOT 3 GB. GPU table, quant options, and 16 GB workaround.
Read guide → NewOllama vs LM Studio vs llama.cpp vs vLLM
Complete 4-way comparison. 10-row decision matrix. All three local runners use the same inference engine — speed is identical.
Read guide → NewLlama 4 Scout Hardware Requirements
109B MoE model needs 57 GB VRAM at Q4. No single consumer GPU fits it — multi-GPU or Mac Studio M4 Max required.
Read guide → NewQwen3-235B Hardware Requirements
Largest open-weight model (235B MoE) needs 119 GB VRAM. Only Mac Studio M4 Ultra fits it. Consumer alternatives listed.
Read guide → NewLocal LLM Function Calling and Tool Use
Set up agents that call APIs and run code with Ollama. Best models: Qwen3-14B, Llama 3.1 8B. Python examples included.
Read guide → NewCodestral Hardware Requirements
Mistral's coding model (22B) needs 13 GB VRAM. Any 16 GB GPU works. FIM support for Continue.dev and Cursor auto-complete.
Read guide → NewBest GPU for Coding LLMs
GPU picks matched to the best coding models by VRAM tier. RTX 4060 Ti 16GB for Codestral; RTX 4090 for Qwen2.5-Coder-32B.
Read guide → NewLLMs on Integrated Graphics: Intel & AMD iGPU
AMD Radeon 890M gets 12-18 tok/s on 7B models. Best models for no-GPU laptops and how to enable GPU acceleration.
Read guide → NewGemma 4 Hardware Requirements
Gemma 4 27B MoE runs in 8 GB VRAM at Q4 — RTX 4060 gets ~28 tok/s. Best model for 8 GB GPUs in 2026.
Read guide →What Can I Run on My GPU?
VRAM tier guide: 8GB through 128GB — exact model names, quantization, and speeds.
Read guide →Best GPU for LLMs
RTX 4060 to RTX 5090 vs Mac Studio — budget tiers compared.
Read guide →CPU-Only LLM Inference
Run LLMs without a GPU. RAM requirements, speed benchmarks, and best models for CPU inference.
Read guide → NewRTX 4090 for LLMs
Best single consumer GPU: 24 GB GDDR6X, 1008 GB/s, up to 110 tok/s. Full guide with model table.
Read guide → NewBest LLMs to Run Locally
Top model picks by VRAM tier — Qwen3 8B, DeepSeek-R1-Distill, Gemma 3, Phi-4, Llama 3.3 70B.
Read guide → NewGemma 3 Hardware Requirements
Google's Gemma 3 (4B to 27B). The 27B fits in 16 GB at Q4. GPU picks and Ollama setup.
Read guide → NewRTX 5070 for LLMs
Blackwell 12 GB GDDR7. Runs Qwen3 14B at Q4 — 33% more bandwidth than RTX 4070 at a lower price.
Read guide → NewBest LLM for Coding Locally
Qwen3 14B, Codestral 22B, Phi-4 14B — ranked by GPU tier. Ollama commands + Continue.dev, Aider, Cursor setup.
Read guide → ComparisonQwen3 14B vs Phi-4 14B
Head-to-head at 12 GB VRAM. Both fit in Q4_K_M — which wins for coding, math, and everyday use?
Read guide → TutorialLLMs on Windows: Complete Setup
Run Ollama and LM Studio on Windows 10/11. NVIDIA, AMD, Intel Arc GPU driver setup. Fix CUDA detection, CPU fallback, port conflicts.
Read guide → Apple SiliconApple Silicon for Local LLMs
How M3 and M4 unified memory actually performs vs discrete GPUs. The Mac Studio M4 Max 64GB fits 70B at Q4_K_M — no $-equivalent NVIDIA setup does.
Read guide → BudgetBest Budget GPU for LLMs
Sub-$500 picks that actually run modern models. RTX 3060 12GB still leads on $/GB VRAM in 2026.
Read guide → Model GuideLlama 4 Hardware Requirements
Llama 4 Scout (109B MoE) runs on 12-16 GB VRAM at Q4. Maverick needs 200+ GB. Which hardware to buy for Meta's latest model.
Read guide → TutorialLM Studio vs Ollama vs Jan
LM Studio for beginners, Ollama for developers and servers, Jan for privacy. Which local AI app should you install first?
Read guide → Model GuideGemma 4 Hardware Requirements
Gemma 4 27B MoE (4B active) fits in 8 GB VRAM at Q4 — RTX 4060 runs it at 28 tok/s. 31B dense needs 24 GB.
Read guide → GPU GuideRTX 5060 for LLMs: 8 GB Only
Only 8 GB VRAM — runs 7-8B models well, cannot run 14B. The RTX 5060 Ti 16 GB is the smarter AI buy.
Read guide → TutorialLocal AI Coding Assistant Setup
Free GitHub Copilot: VS Code + Continue + Ollama + Qwen3-Coder 14B. Works offline, 100% private, needs 16 GB VRAM.
Read guide → ComparisonAMD vs NVIDIA for Local LLMs
AMD wins on VRAM per dollar; NVIDIA wins on Windows ease. ROCm 6.3 on Linux is finally production-ready. Full comparison for 2026.
Read guide → GPU GuideAMD RX 9060 XT LLM Guide
16 GB GDDR6 — double the VRAM of RTX 5060 for a small premium. Runs Qwen3 14B Q4 at 38 tok/s with ROCm 6.3.
Read guide → Buying GuideAMD Strix Halo Mini PC Guide
128 GB unified memory runs Llama 70B at 8-10 tok/s — about half the price of Mac Studio M4 Max 128 GB.
Read guide → TutorialRun Claude Locally? Here's What Works
Claude weights are not public. Best local alternatives: Qwen3 14B matches Claude Haiku (16 GB), Qwen3 72B matches Sonnet (48 GB).
Read guide → GuideBest LLMs for 48 GB VRAM
Llama 3.3 70B Q4 fits at 42 GB, runs 20 tok/s on Mac M4 Pro. Qwen3 72B also fits. The full 70B tier breakdown.
Read guide → GuideBest LLMs for 24 GB VRAM
Qwen3 32B Q4 (19 GB) runs at 28 tok/s on RTX 4090. The jump from 16 GB that unlocks real 32B quality.
Read guide → TutorialOpen WebUI Setup Guide
ChatGPT-style interface for your local Ollama — RAG, voice, web search, multi-user. Docker install in one command.
Read guide → Buying GuideBest Budget GPU for LLMs
Entry tier: RTX 3060 12 GB. Budget tier: Intel Arc B580. Mid-budget: AMD RX 9060 XT 16 GB. Best value overall: RTX 5070 Ti.
Read guide → AdvancedLLM Fine-Tuning Hardware
QLoRA on 7B needs 10 GB VRAM; 14B needs 24 GB. RTX 4090 is the best consumer GPU. AMD lacks tool support — NVIDIA only.
Read guide → AdvancedDual GPU Setup for 70B Models
Two RTX 3090s give 48 GB combined VRAM — runs Llama 70B at 8-10 tok/s. NVLink vs PCIe, PSU sizing, Ollama setup.
Read guide → TutorialLocal Voice AI Setup
Whisper + Ollama + Kokoro TTS: fully offline voice assistant in 2026. 8 GB VRAM, 2-5s latency, works air-gapped. Setup guide.
Read guide → ComparisonRTX 4070 Ti Super vs RTX 4080
Both 16 GB, same models, 7% speed difference, a noticeable price gap. In 2026 the RTX 5070 Ti beats both by 33% for similar money.
Read guide → ComparisonRTX 3090 vs RTX 4070 Ti Super
Used 24 GB vs new 16 GB. 70B models need 24 GB — which is the better buy for local LLMs?
Read guide → ComparisonRTX 4070 vs 4080 vs 4090
12 GB vs 16 GB vs 24 GB VRAM. 32B models need 24 GB — which GPU should you buy?
Read guide → NewAI on Your Gaming PC
Your gaming GPU already runs AI. RTX 4060 to RTX 4090 tier table — see exactly what you can run with your GPU right now.
Read guide → NewBest LLMs for 8 GB VRAM
RTX 3060 / 4060 users: Qwen3 7B Q8 runs at 35 t/s, Phi-4 14B fits at Q4. Full model fit table and best picks.
Read guide → NewPrivate Offline AI Setup
Run AI with zero data leaving your PC. Ollama + Qwen3 14B, no internet after setup, works on air-gapped machines.
Read guide → NewRun ChatGPT Locally — Free
Open-source models now match GPT-4 quality. Free, private, no subscription. Ollama + Open WebUI in under 10 minutes.
Read guide → NewBest LLMs for 16 GB VRAM
RTX 4060 Ti, 4080, 4070 Ti Super users: Qwen3 14B Q8 fits at 14.8 GB and runs at 30+ t/s. Gemma 3 27B Q4 also fits.
Read guide → ComparisonRTX 5070 vs RTX 4070 for LLMs
Same 12 GB VRAM, same models. 5070 is 33% faster for less money. Upgrade verdict: buying new — yes; already own 4070 — no.
Read guide → ComparisonRTX 5080 vs RTX 4090 for LLMs
RTX 5080 is 16 GB, RTX 4090 is 24 GB. Both fast for 7-14B — only the 4090 fits 32B at Q4. Buy the 5080 unless you need 32B+.
Read guide → ComparisonRTX 4060 vs RTX 4070 for LLMs
8 GB vs 12 GB: the 4070 is nearly 2x faster AND runs 14B models the 4060 cannot. Worth the premium if you run anything above 7B.
Read guide → NewBest LLMs for 24 GB VRAM
RTX 4090/3090 users: Qwen3 32B Q4 runs at 38 t/s. DeepSeek R1 32B fits. Full model fit table and top picks for the 32B sweet spot.
Read guide → NewRun LLMs on Mac: Ollama Setup
M1 through M4 all run AI locally. Metal GPU acceleration is automatic. One command install, then ollama run qwen3:8b — done in 5 minutes.
Read guide → NewM4 Mac Mini vs M4 Pro for LLMs
M4 24 GB vs M4 Pro 48 GB. M4 Pro is 2.3x faster and the only Mac Mini that runs 70B models. Clear buy recommendation inside.
Read guide → NewRun LLMs on Linux: Ollama Setup
One curl command installs Ollama on Ubuntu, Fedora, or Arch. NVIDIA works instantly. AMD needs ROCm — step-by-step included.
Read guide → TutorialHow to Run DeepSeek R1 Locally
Step-by-step Ollama setup for DeepSeek R1 distills (8B–70B). Thinking mode explained, common issues fixed.
Read guide → TutorialHow to Run Qwen3 Locally
Ollama setup for Qwen3 0.6B–32B and MoE. Thinking mode per query, LM Studio alternative, Windows/Mac/Linux.
Read guide → TutorialOpen WebUI + Ollama Setup
Get a ChatGPT-like browser interface for your local LLMs in 5 minutes. Free, private, no API key required.
Read guide → TutorialHow to Run Llama 3 Locally
Run Meta Llama 3.1 8B or Llama 3.3 70B via Ollama. Step-by-step for Windows, Mac, Linux. Hardware picks included.
Read guide → TutorialHow to Run Mistral Locally
Run Mistral 7B (4.5 GB), Nemo 12B, or Small 22B via Ollama. Works on any 8 GB+ GPU. Windows, Mac, Linux.
Read guide → TutorialHow to Run Gemma 3 Locally
Google's Gemma 3 27B fits in 16 GB VRAM. Multimodal — process images locally. Ollama setup guide.
Read guide → TutorialHow to Run Phi-4 Locally
Microsoft's Phi-4 14B needs only 9-10 GB VRAM and beats Llama 3.1 8B on reasoning. Ollama setup guide.
Read guide → TutorialHow to Run Llama 4 Scout Locally
Llama 4 Scout needs 58 GB VRAM due to MoE — explains why, dual-GPU llama.cpp setup, and alternatives for 8-24 GB GPUs.
Read guide → TutorialLocal AI Coding Assistant Setup
VS Code + Ollama + Continue.dev — free GitHub Copilot alternative. Chat model + FIM autocomplete configured in under 10 minutes.
Read guide → ComparisonRTX 5070 Ti vs RTX 5080 for LLMs
Both have 16 GB GDDR7 and run identical models. 5080 is 7% faster. Buy the 5070 Ti unless budget is no concern.
Read guide → TutorialLM Studio: Complete Setup Guide
Desktop GUI for running LLMs offline. Download, load a model, and chat in minutes. No terminal needed. GPU acceleration, model browser, OpenAI-compatible API.
Read guide → NewBest LLMs for 32 GB VRAM (RTX 5090)
RTX 5090 users: Qwen3 32B Q8 runs at 45+ t/s. QwQ-32B reasoning model fits. Full model fit table for the 32 GB tier.
Read guide → NewBest LLMs for 12 GB VRAM (RTX 4070 / 5070)
Qwen3 14B Q4 runs at 30 t/s on RTX 4070. Phi-4 14B fits. Gemma 3 12B is fastest. The 12 GB sweet spot explained.
Read guide → ReferenceOllama Commands Cheat Sheet
Every Ollama CLI command in one place: run, pull, list, API endpoints, Modelfile guide, environment variables, one-liners.
Read guide → NewAMD RX 9070 XT for Local LLMs
16 GB GDDR6, 896 GB/s — matches RTX 5080 LLM throughput. RDNA 4 ROCm 6.2 setup guide and model fit table.
Read guide → Reality CheckDeepSeek V3: Can You Run It Locally?
DeepSeek V3 (685B) needs 390 GB VRAM — no consumer GPU can run it. Honest analysis and the best consumer alternatives.
Read guide → NewRTX 5060 Ti for LLMs: 8 GB vs 16 GB
Always buy the 16 GB variant. Qwen3 14B Q4 at 35+ t/s. The RTX 5070 is 2x faster — worth considering.
Read guide → ReferenceLLM System Requirements: CPU, RAM, PSU
GPU VRAM is the bottleneck but you also need 32 GB RAM, NVMe SSD, and a right-sized PSU. Complete spec tables per tier.
Read guide → TutorialRun Qwen3 30B MoE Locally
30B total params, only 3B active — runs at 30 t/s on RTX 4090 while fitting in 20 GB. Thinking mode guide and performance vs dense comparison.
Read guide → DeveloperOllama Python API Guide
Use Ollama from Python: ollama library, OpenAI-compatible endpoint, streaming, embeddings, async. Working code examples for all patterns.
Read guide → GuideHow to Run 70B Models Locally
Llama 3.3 70B needs 42 GB VRAM. RTX 4090 alone is not enough. Mac M4 Pro 48 GB is the best consumer option. Dual GPU setup also covered.
Read guide → ComparisonRTX 5090 vs RTX 4090 for LLMs
5090 is 78% faster and has 32 GB vs 24 GB. Enables Qwen3 32B at Q6. Worth the premium? Speed table, model fit, buy verdict.
Read guide → ComparisonMac Mini vs Mac Studio for LLMs
M4 Pro 48 GB runs 70B at 12 t/s. M4 Max 64 GB runs it at 20 t/s. Full comparison and buy recommendation.
Read guide → NewBest LLMs for 6 GB VRAM
RTX 3060 6GB and GTX 1660 Super: Qwen3 7B Q4 at 30 t/s. Limited to 7-8B but surprisingly capable. Upgrade analysis vs 8 GB included.
Read guide → Advancedllama.cpp Guide: Run Without Ollama
Direct GGUF inference with full control. CUDA/Metal/ROCm build, GPU layer flags, server mode, performance tuning. For power users.
Read guide → ComparisonRTX 5070 vs RTX 5070 Ti for LLMs
The 5070 is faster than the 5070 Ti on shared models — the 5070 Ti just has more VRAM. The price step up buys 14B Q8 capability, not speed.
Read guide → TutorialRAG: Chat With Your Own Documents
Add document search to your local LLM: Open WebUI (no code), AnythingLLM (one app), or Python LangChain. Hardware requirements and embedding model picks.
Read guide → Use CaseBest LLMs for Writing Locally
Llama 3.3 70B for fiction, Qwen3 14B Q8 for content, Mistral Small 22B for creative. Temperature guide and system prompts for each use case.
Read guide → FAQIs 8 GB VRAM Enough for AI?
Yes — Qwen3 7B at 35 t/s fits comfortably, but 14B models at Q4 do not. Upgrade analysis, Stable Diffusion verdict, best 8 GB GPUs.
Read guide → TutorialVision/Multimodal LLMs Locally
Gemma 3 4B runs image + text in 4 GB VRAM. Describe images, extract text from screenshots, analyze charts — all private with Ollama.
Read guide → AdvancedHome Server LLM Setup Guide
Run Ollama 24/7 on a home server. Network access, Tailscale VPN, Open WebUI frontend. Power cost: Mac Mini M4 Pro is 10x cheaper to run than a gaming PC.
Read guide → TutorialRun Ollama in Docker (GPU Support)
NVIDIA GPU passthrough in 1 extra flag. Docker Compose file with Open WebUI included. 3 commands from zero to running model.
Read guide → ReferenceHow to Speed Up Ollama
Flash attention, optimized context size, Q4_K_M, KEEP_ALIVE=-1. Full environment variables reference and benchmark commands.
Read guide →631 shown
Qwen/Qwen3-VL-2B-Instruct
google/electra-base-discriminator
BAAI/bge-small-en-v1.5
BAAI/bge-m3
Qwen/Qwen3-0.6B
openai-community/gpt2
BAAI/bge-large-en-v1.5
Qwen/Qwen2.5-7B-Instruct
deepseek-ai/DeepSeek-V3.2
Qwen/Qwen3-4B-Instruct-2507
Qwen/Qwen3-8B
BAAI/bge-reranker-v2-m3
meta-llama/Llama-3.1-8B-Instruct
Qwen/Qwen2.5-1.5B-Instruct
Qwen/Qwen2.5-3B-Instruct
Qwen/Qwen2.5-VL-7B-Instruct
BAAI/bge-base-en-v1.5
facebook/opt-125m
google/gemma-4-31B-it
trl-internal-testing/tiny-Qwen2ForCausalLM-2.5
Qwen/Qwen3.5-9B
openai/gpt-oss-20b
google/gemma-4-26B-A4B-it
Qwen/Qwen2.5-0.5B-Instruct
meta-llama/Llama-3.2-1B-Instruct
Qwen/Qwen3-Embedding-0.6B
Qwen/Qwen3-1.7B
google/gemma-4-E4B-it
google/vit-base-patch16-224
Qwen/Qwen3-32B
dphn/dolphin-2.9.1-yi-1.5-34b
Qwen/Qwen3.5-4B
Qwen/Qwen3-VL-8B-Instruct
openai/gpt-oss-120b
Qwen/Qwen3-4B
Qwen/Qwen2-VL-2B-Instruct
deepseek-ai/DeepSeek-R1
Qwen/Qwen2-1.5B-Instruct
Qwen/Qwen3.5-35B-A3B
google/mobilebert-uncased
Qwen/Qwen2.5-VL-3B-Instruct
unsloth/gemma-4-26B-A4B-it-GGUF
microsoft/table-transformer-detection
meta-llama/Meta-Llama-3-8B
mistralai/Mistral-7B-Instruct-v0.3
google/vit-base-patch16-224-in21k
Qwen/Qwen3.5-27B
google/gemma-4-E2B-it
TinyLlama/TinyLlama-1.1B-Chat-v1.0
EleutherAI/pythia-160m
Qwen/Qwen3.5-0.8B
Qwen/Qwen3-14B
BAAI/bge-reranker-base
llava-hf/llava-1.5-7b-hf
distilbert/distilgpt2
google/gemma-3-12b-it
Qwen/Qwen3-Coder-30B-A3B-Instruct
vikhyatk/moondream2
moonshotai/Kimi-K2.5
Qwen/Qwen2.5-14B-Instruct
hmellor/tiny-random-LlamaForCausalLM
microsoft/TRELLIS-image-large
microsoft/deberta-v3-base
deepseek-ai/DeepSeek-OCR
Qwen/Qwen2-VL-7B-Instruct
Qwen/Qwen3.6-35B-A3B
openai-community/gpt2-large
BAAI/bge-small-zh-v1.5
Qwen/Qwen3-VL-4B-Instruct
Qwen/Qwen3-VL-Embedding-2B
google/gemma-3-4b-it
mistralai/Mistral-7B-Instruct-v0.2
Qwen/Qwen2.5-Coder-7B-Instruct
Qwen/Qwen3.6-35B-A3B-FP8
nvidia/Gemma-4-31B-IT-NVFP4
google/siglip-so400m-patch14-384
Qwen/Qwen3-0.6B-FP8
meta-llama/Llama-3.2-3B-Instruct
unsloth/gemma-4-31B-it-GGUF
stabilityai/stable-diffusion-xl-base-1.0
unsloth/Qwen3.6-35B-A3B-GGUF
Qwen/Qwen2.5-7B
google/siglip-base-patch16-224
deepseek-ai/DeepSeek-R1-Distill-Llama-8B
Qwen/Qwen2.5-0.5B
Qwen/Qwen3-Embedding-8B
RedHatAI/Llama-3.2-1B-Instruct-FP8-dynamic
Qwen/Qwen3-4B-Base
Qwen/Qwen2.5-14B-Instruct-AWQ
unsloth/gemma-4-E4B-it-GGUF
Qwen/Qwen3-ASR-1.7B
google/flan-t5-base
Qwen/Qwen3-Embedding-4B
meta-llama/Llama-2-7b-hf
Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
Qwen/Qwen2.5-Coder-7B
Qwen/Qwen3.5-2B
Qwen/Qwen2-VL-7B-Instruct-AWQ
nvidia/bigvgan_v2_22khz_80band_256x
microsoft/Phi-3.5-vision-instruct
EleutherAI/pythia-70m-deduped
microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract
meta-llama/Meta-Llama-3-8B-Instruct
microsoft/table-transformer-structure-recognition-v1.1-all
apple/OpenELM-1_1B-Instruct
microsoft/Phi-4-mini-instruct
Qwen/Qwen3.6-27B-FP8
meta-llama/Llama-3.1-8B
deepseek-ai/DeepSeek-OCR-2
meta-llama/Llama-3.2-1B
Qwen/Qwen3-TTS-12Hz-1.7B-Base
Qwen/Qwen3.5-35B-A3B-FP8
Qwen/Qwen3-Reranker-0.6B
zai-org/GLM-5-FP8
opendatalab/MinerU2.5-2509-1.2B
HuggingFaceTB/SmolLM2-135M-Instruct
stable-diffusion-v1-5/stable-diffusion-v1-5
nvidia/DeepSeek-R1-0528-NVFP4-v2
microsoft/mdeberta-v3-base
Qwen/Qwen2.5-32B-Instruct-AWQ
Qwen/Qwen3.5-27B-FP8
llamafactory/tiny-random-Llama-3
Qwen/Qwen3.5-397B-A17B-FP8
Qwen/Qwen3-VL-Embedding-8B
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
Qwen/Qwen3-VL-235B-A22B-Instruct
nvidia/Kimi-K2.5-NVFP4
Qwen/Qwen2.5-7B-Instruct-AWQ
Qwen/Qwen3-VL-32B-Instruct
Qwen/Qwen3-30B-A3B
Qwen/Qwen2.5-Coder-32B-Instruct
Tongyi-MAI/Z-Image-Turbo
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit
google/embeddinggemma-300m
microsoft/table-transformer-structure-recognition
google/t5gemma-s-s-prefixlm
bigscience/bloomz-560m
HuggingFaceTB/SmolLM2-135M
deepseek-ai/DeepSeek-V3
Qwen/Qwen3-8B-AWQ
OpenGVLab/InternVL2-2B
BAAI/bge-multilingual-gemma2
microsoft/VibeVoice-Realtime-0.5B
cyankiwi/gemma-4-31B-it-AWQ-4bit
kaitchup/Phi-3-mini-4k-instruct-gptq-4bit
Qwen/Qwen3-30B-A3B-Instruct-2507
google/electra-small-discriminator
mistralai/Mistral-Small-3.2-24B-Instruct-2506
unsloth/gemma-4-E2B-it-GGUF
mistralai/Voxtral-Mini-4B-Realtime-2602
allenai/longformer-base-4096
Qwen/Qwen3.6-27B
deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct
Qwen/Qwen2.5-Coder-14B-Instruct
BAAI/bge-reranker-large
Qwen/Qwen2.5-Coder-32B-Instruct-AWQ
nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1
Qwen/Qwen3.5-122B-A10B
unsloth/Qwen3.5-9B-GGUF
nvidia/bigvgan_v2_44khz_128band_512x
unsloth/Qwen3.5-35B-A3B-GGUF
microsoft/deberta-v3-large
unsloth/Qwen3.6-27B-GGUF
Qwen/Qwen3-Reranker-4B
google/gemma-4-E4B
Qwen/Qwen3-Coder-Next
microsoft/Florence-2-base
microsoft/tapex-base-finetuned-wikisql
google/owlv2-base-patch16-ensemble
BAAI/bge-large-zh-v1.5
Qwen/Qwen3.5-35B-A3B-GPTQ-Int4
crynux-network/sdxl-turbo
Qwen/Qwen3-VL-32B-Instruct-FP8
microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224
stabilityai/sdxl-turbo
Qwen/Qwen2.5-Coder-14B-Instruct-AWQ
lightonai/LightOnOCR-2-1B
microsoft/Florence-2-large
crynux-network/stable-diffusion-v1-5
nvidia/llama-nemotron-embed-1b-v2
Qwen/Qwen2.5-Omni-3B
black-forest-labs/FLUX.1-dev
Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF
microsoft/layoutlmv3-base
black-forest-labs/FLUX.1-schnell
microsoft/Phi-3-mini-4k-instruct
Qwen/Qwen2.5-Coder-7B-Instruct-GPTQ-Int4
microsoft/Phi-3.5-mini-instruct
allenai/unifiedqa-t5-small
HuggingFaceTB/SmolVLM-256M-Instruct
Qwen/Qwen3-Omni-30B-A3B-Instruct
nvidia/parakeet-ctc-1.1b
moonshotai/Kimi-K2.6
microsoft/deberta-v3-small
Qwen/Qwen3-TTS-12Hz-0.6B-Base
microsoft/VibeVoice-ASR
Qwen/Qwen3-VL-30B-A3B-Instruct
microsoft/trocr-base-printed
mistralai/Mixtral-8x7B-Instruct-v0.1
Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2
codgician/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GPTQ-int4
pytorch/gemma-3-27b-it-AWQ-INT4
microsoft/phi-4
mlx-community/gemma-3-4b-it-qat-4bit
google/siglip2-so400m-patch14-384
microsoft/layoutlmv2-base-uncased
CompVis/stable-diffusion-v1-4
microsoft/llmlingua-2-xlm-roberta-large-meetingbank
google/fnet-base
stabilityai/sd-turbo
Qwen/Qwen2.5-Coder-1.5B
google/siglip2-base-patch16-224
microsoft/deberta-large-mnli
mistralai/Voxtral-Mini-3B-2507
microsoft/wavlm-base-plus
microsoft/phi-2
microsoft/wavlm-large
google/flan-t5-small
ricdomolm/mini-coder-1.7b
google/flan-t5-large
mistralai/Mistral-Small-3.1-24B-Instruct-2503
microsoft/resnet-18
Qwen/Qwen2.5-Omni-7B
nvidia/llama-nemotron-rerank-1b-v2
Qwen/Qwen3-ASR-0.6B
bartowski/Qwen2.5-Coder-7B-Instruct-GGUF
nvidia/personaplex-7b-v1
Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
google/madlad400-3b-mt
allenai/specter2_base
google/siglip2-base-patch16-naflex
TechxGenus/DeepSeek-Coder-V2-Lite-Instruct-AWQ
Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8
google/bert_uncased_L-2_H-128_A-2
nvidia/speakerverification_en_titanet_large
frankjoshua/novaAnimeXL_ilV140
microsoft/deberta-xlarge-mnli
NexVeridian/Qwen3-Coder-Next-8bit
allenai/scibert_scivocab_uncased
Qwen/Qwen2.5-Coder-3B-Instruct
QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ
BAAI/bge-base-zh-v1.5
mistralai/Mistral-Nemo-Instruct-2407
microsoft/swinv2-tiny-patch4-window16-256
nvidia/canary-1b-flash
BAAI/bge-small-en
microsoft/Phi-4-multimodal-instruct
nvidia/parakeet-tdt-0.6b-v3
microsoft/unispeech-sat-large-sv
google/t5-v1_1-xxl
Qwen/Qwen3-Coder-Next-FP8
BAAI/bge-base-zh
google/siglip2-so400m-patch16-naflex
bigscience/bloom-560m
microsoft/markuplm-base
mistralai/Devstral-Small-2-24B-Instruct-2512
stabilityai/stable-video-diffusion-img2vid-xt
microsoft/graphcodebert-base
Qwen/Qwen2.5-Coder-7B-Instruct-AWQ
lightx2v/Qwen-Image-Lightning
microsoft/resnet-50
nvidia/segformer-b0-finetuned-ade-512-512
microsoft/VibeVoice-ASR-HF
stabilityai/sdxl-vae
John6666/diving-illustrious-real-asian-v50-sdxl
John6666/one-obsession-17-red-sdxl
microsoft/codebert-base
google/siglip2-base-patch16-512
diffusers/stable-diffusion-xl-1.0-inpainting-0.1
google/byt5-small
martineux/dvine82-xl
meta-llama/Prompt-Guard-86M
microsoft/beit-base-patch16-224
stabilityai/stable-diffusion-3.5-medium
List-cloud/List-3.0-Ultra-Coder-Brain
microsoft/VibeVoice-1.5B
mistralai/Ministral-3-3B-Instruct-2512
google/gemma-4-E2B
microsoft/Phi-3-mini-128k-instruct
google/pegasus-xsum
playgroundai/playground-v2.5-1024px-aesthetic
google/timesfm-2.5-200m-transformers
stabilityai/stable-diffusion-xl-refiner-1.0
google/mt5-small
google/siglip2-so400m-patch16-384
microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
microsoft/BiomedVLP-CXR-BERT-specialized
deepseek-ai/deepseek-coder-6.7b-instruct
lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-4bit
Qwen/Qwen2.5-Coder-0.5B-Instruct
lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-5bit
unsloth/Qwen3-Coder-Next-GGUF
google/mt5-large
ByteDance/SDXL-Lightning
John6666/nova-furry-xl-il-v120-sdxl
stabilityai/TripoSR
cagliostrolab/animagine-xl-4.0
Qwen/Qwen2.5-Coder-1.5B-Instruct
stable-diffusion-v1-5/stable-diffusion-inpainting
lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-8bit
mistralai/Ministral-3-14B-Instruct-2512
mistralai/Mistral-7B-v0.3
casperhansen/deepseek-coder-v2-instruct-awq
lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-6bit
h94/IP-Adapter-FaceID
Qwen/Qwen-Image
google/siglip2-so400m-patch14-224
John6666/prefect-illustrious-xl-v3-sdxl
microsoft/harrier-oss-v1-0.6b
codellama/CodeLlama-7b-hf
microsoft/layoutlm-base-uncased
nvidia/Llama-4-Scout-17B-16E-Instruct-FP8
nvidia/segformer-b1-finetuned-ade-512-512
optimum-intel-internal-testing/tiny-stable-diffusion-torch
nvidia/parakeet-tdt-0.6b-v2
google/owlvit-base-patch32
nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4
nphSi/Z-Image-Lora
nvidia/audio-flamingo-3-hf
bigcode/tiny_starcoder_py
microsoft/trocr-large-handwritten
mistralai/Ministral-8B-Instruct-2410
optimum-intel-internal-testing/tiny-random-stable-diffusion-xl
nvidia/canary-1b-v2
microsoft/xclip-base-patch32
microsoft/kosmos-2-patch14-224
mistralai/Mixtral-8x7B-v0.1
microsoft/speecht5_tts
google/bigbird-roberta-base
microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank
cagliostrolab/animagine-xl-3.1
John6666/obsession-illustriousxl-v10-sdxl
ostris/OpenFLUX.1
microsoft/Phi-3-vision-128k-instruct
unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit
nvidia/llama-nemotron-rerank-vl-1b-v2
microsoft/trocr-base-handwritten
microsoft/unixcoder-base
microsoft/harrier-oss-v1-270m
nvidia/Llama-3.3-70B-Instruct-NVFP4
ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF
lmstudio-community/Qwen2.5-Coder-14B-Instruct-MLX-4bit
optimum-intel-internal-testing/stable-diffusion-3-tiny-random
microsoft/deberta-v2-xlarge
google/siglip-large-patch16-384
John6666/amanatsu-illustrious-v11-sdxl
segmind/small-sd
microsoft/wavlm-base-plus-sv
Wan-AI/Wan2.1-T2V-1.3B-Diffusers
city96/FLUX.1-dev-gguf
stabilityai/sd-x2-latent-upscaler
nvidia/Llama-3.1-8B-Instruct-NVFP4
microsoft/trocr-large-printed
cyankiwi/Qwen3-Coder-30B-A3B-Instruct-AWQ-4bit
google/siglip2-base-patch16-256
Qwen/Qwen2.5-Coder-1.5B-Instruct-AWQ
optimum-intel-internal-testing/tiny-random-flux
mistralai/Ministral-3-8B-Reasoning-2512-GGUF
BAAI/bge-base-en
BAAI/bge-reranker-v2.5-gemma2-lightweight
meta-llama/Llama-Prompt-Guard-2-86M
microsoft/infoxlm-large
BSC-LT/salamandra-7b-instruct
lmstudio-community/Qwen2.5-Coder-14B-Instruct-MLX-8bit
stabilityai/sd-vae-ft-mse
Tencent-Hunyuan/HunyuanDiT-v1.1-Diffusers-Distilled
Qwen/Qwen2.5-Coder-7B-Instruct-GGUF
microsoft/deberta-base
microsoft/wavlm-base-plus-sd
google/mt5-base
RunDiffusion/Juggernaut-XL-v9
BAAI/bge-small-zh
RedHatAI/DeepSeek-Coder-V2-Lite-Instruct-FP8
google/t5-v1_1-large
nvidia/llama-nemotron-embed-vl-1b-v2
Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8
mistralai/Mistral-Small-24B-Instruct-2501
nvidia/music-flamingo-2601-hf
microsoft/Phi-3.5-MoE-instruct
nvidia/canary-qwen-2.5b
cagliostrolab/animagine-xl-3.0
microsoft/mpnet-base
mistralai/Ministral-3-8B-Instruct-2512
optimum-intel-internal-testing/tiny-random-latent-consistency
BAAI/bge-large-en
nvidia/stt_ar_fastconformer_hybrid_large_pcd_v1.0
cyankiwi/Qwen3-Coder-Next-AWQ-4bit
Salesforce/codegen-350M-mono
Laxhar/noobai-XL-1.1
lightx2v/Qwen-Image-2512-Lightning
microsoft/deberta-base-mnli
nvidia/segformer-b4-finetuned-ade-512-512
google/siglip2-giant-opt-patch16-384
xinsir/controlnet-union-sdxl-1.0
SimianLuo/LCM_Dreamshaper_v7
google/siglip2-so400m-patch16-256
microsoft/swin-base-patch4-window12-384-in22k
google/siglip2-base-patch16-384
nvidia/mit-b0
Wan-AI/Wan2.2-TI2V-5B-Diffusers
John6666/hassaku-xl-illustrious-v31-sdxl
google/umt5-xxl
SG161222/Realistic_Vision_V5.1_noVAE
bigcode/starcoder2-3b
T5B/Z-Image-Turbo-FP8
BAAI/AltCLIP
google/rembert
microsoft/speecht5_asr
microsoft/prophetnet-large-uncased
Wan-AI/Wan2.2-T2V-A14B-Diffusers
XLabs-AI/xflux_text_encoders
microsoft/codebert-base-mlm
google/bert_for_seq_generation_L-24_bbc_encoder
OnomaAIResearch/Illustrious-xl-early-release-v0
CohereLabs/c4ai-command-a-03-2025
google/siglip2-large-patch16-256
google/flan-t5-xl
codellama/CodeLlama-7b-Instruct-hf
microsoft/speecht5_hifigan
nvidia/Alpamayo-1.5-10B
SG161222/RealVisXL_V5.0
stabilityai/stable-diffusion-3-medium-diffusers
microsoft/layoutlmv3-large
Qwen/Qwen2.5-Coder-1.5B-Instruct-GGUF
ali-vilab/text-to-video-ms-1.7b
janhq/Jan-v3-4B-base-instruct-gguf
google/byt5-base
nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8
google/siglip2-so400m-patch16-512
unsloth/ERNIE-Image-Turbo-GGUF
nvidia/Llama-4-Scout-17B-16E-Instruct-NVFP4
optimum-intel-internal-testing/tiny-random-sana
QuantStack/Wan2.2-T2V-A14B-GGUF
city96/FLUX.1-schnell-gguf
mistralai/Devstral-Small-2505
ostris/zimage_turbo_training_adapter
Qwen/Qwen-Image-2512
nvidia/nemotron-colembed-vl-4b-v2
google/vit-base-patch16-384
google/vit-large-patch16-224-in21k
nvidia/segformer-b2-finetuned-ade-512-512
nvidia/segformer-b5-finetuned-ade-640-640
Lykon/DreamShaper
nvidia/diar_streaming_sortformer_4spk-v2.1
microsoft/harrier-oss-v1-27b
mistralai/Mistral-Small-4-119B-2603
unsloth/Qwen-Image-2512-GGUF
mistralai/Voxtral-Small-24B-2507
microsoft/trocr-small-handwritten
google/t5-v1_1-base
Wan-AI/Wan2.1-T2V-14B
google/tapas-large-finetuned-sqa
moonshotai/Kimi-Audio-7B-Instruct
microsoft/rad-dino
HiDream-ai/HiDream-I1-Fast
stabilityai/stable-diffusion-3.5-large
Manojb/stable-diffusion-2-1-base
microsoft/deberta-v3-xsmall
google/mobilenet_v2_1.0_224
google/siglip2-large-patch16-384
stabilityai/stable-video-diffusion-img2vid
microsoft/wavlm-base
nvidia/Cosmos-Predict2-2B-Video2World
google/medsiglip-448
google/owlv2-large-patch14-ensemble
FastVideo/FastWan2.2-TI2V-5B-FullAttn-Diffusers
nvidia/Alpamayo-R1-10B
Wan-AI/Wan2.1-T2V-14B-Diffusers
microsoft/swin-base-patch4-window7-224
google/muril-base-cased
microsoft/BiomedVLP-CXR-BERT-general
nvidia/mit-b2
CohereLabs/command-a-reasoning-08-2025
google/ddpm-cifar10-32
alibaba-pai/Wan2.2-Fun-Reward-LoRAs
nvidia/llama-embed-nemotron-8b
deepseek-ai/Janus-Pro-7B
Lightricks/LTX-Video-ICLoRA-detailer-13b-0.9.8
zai-org/CogVideoX-5b
magespace/Wan2.2-I2V-A14B-Lightning-Diffusers
nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
microsoft/Multilingual-MiniLM-L12-H384
microsoft/infoxlm-base
google/timesfm-2.0-500m-pytorch
microsoft/MiniLM-L12-H384-uncased
BAAI/llm-embedder
microsoft/trocr-small-printed
nvidia/parakeet-ctc-0.6b
nvidia/NV-Embed-v2
nvidia/GR00T-N1.6-3B
BAAI/bge-reranker-v2-gemma
nvidia/mit-b3
nvidia/segformer-b5-finetuned-cityscapes-1024-1024
microsoft/swin-tiny-patch4-window7-224
microsoft/rad-dino-maira-2
nvidia/diar_streaming_sortformer_4spk-v2
stabilityai/stable-virtual-camera
tiiuae/Falcon-OCR
city96/Wan2.1-T2V-14B-gguf
stabilityai/stable-audio-open-1.0
nvidia/segformer-b3-finetuned-ade-512-512
zai-org/CogVideoX-2b
bullerwins/Wan2.2-T2V-A14B-GGUF
BAAI/bge-large-zh
Wan-AI/Wan2.1-T2V-1.3B
tiiuae/Falcon-Perception
QuantStack/Wan2.2-TI2V-5B-GGUF
deepseek-ai/Janus-Pro-1B
IPostYellow/TurboWan2.1-T2V-1.3B-Diffusers
Abiray/LTX-2.3-22B-DISTILLED-1.1-GGUF
nvidia/GR00T-N1.7-3B
allenai/specter
allenai/MolmoPoint-Vid-4B
google/medasr
calcuis/wan-gguf
vrgamedevgirl84/Wan14BT2VFusioniX
nvidia/omni-embed-nemotron-3b
stabilityai/stable-fast-3d
alibaba-pai/Wan2.1-Fun-14B-Control
ByteDance/AnimateDiff-Lightning
QuantStack/Wan2.1_14B_VACE-GGUF
google/tipsv2-b14
stabilityai/stable-video-diffusion-img2vid-xt-1-1
wangfuyun/AnimateLCM
LiquidAI/LFM2.5-8B-A1B
BAAI/bge-code-v1
genmo/mochi-1-preview
moonshotai/MoonViT-SO-400M
meta-llama/Llama-Prompt-Guard-2-22M
jayn7/HunyuanVideo-1.5_T2V_720p-GGUF
allenai/specter2_aug2023refresh_base
google/tipsv2-l14
calcuis/wan2-gguf
allenai/Molmo2-VideoPoint-4B
Wan-AI/Wan2.2-TI2V-5B
BAAI/bge-reranker-v2-minicpm-layerwise
calcuis/wan-1.3b-gguf
cerspense/zeroscope_v2_576w
hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-720p_t2v
mistralai/Voxtral-4B-TTS-2603
stabilityai/stable-audio-open-small
Wan-AI/Wan2.2-T2V-A14B
QuantStack/Wan2.2-S2V-14B-GGUF
Skywork/SkyReels-V2-DF-1.3B-540P-Diffusers
google/tipsv2-so400m14
BAAI/Emu3-VisionTokenizer
deepseek-ai/Janus-1.3B
guoyww/animatediff-motion-adapter-v1-5-2
alibaba-pai/Wan2.1-Fun-Reward-LoRAs
city96/HunyuanVideo-gguf
calcuis/ltxv0.9.6-gguf
Motif-Technologies/Motif-Video-2B
gdhe17/Self-Forcing
BAAI/bge-m3-unsupervised
BAAI/Emu3-Gen
BAAI/BGE-VL-large
BestWishYsh/Helios-Distilled
samuelchristlie/Wan2.1-T2V-1.3B-GGUF
allenai/aspire-biencoder-biomed-scib
wanabmeya/clip_vision_h.safetensors
nvidia/nemotron-ocr-v2
hotshotco/Hotshot-XL
QuantStack/Wan2.2-Fun-A14B-Control-GGUF
google/tipsv2-l14-dpt
jayn7/HunyuanVideo-1.5_T2V_480p-GGUF
BAAI/bge-en-icl
nvidia/Lyra-2.0
Lightricks/LTX-Video-0.9.7-distilled
QuantStack/Wan2.2-Fun-A14B-InP-GGUF
allenai/MolmoAct-7B-D-LIBERO-Goal-0812
tiiuae/Falcon-Perception-300M
Runware/Wan2.2-TI2V-5B
nvidia/Cosmos-1.0-Diffusion-7B-Text2World
google/tipsv2-g14
guoyww/animatediff-motion-lora-zoom-out
guoyww/animatediff-motion-lora-zoom-in
guoyww/animatediff-motion-lora-pan-right
guoyww/animatediff-motion-lora-pan-left
guoyww/animatediff-motion-adapter-v1-5-3
ali-vilab/i2vgen-xl
guoyww/animatediff-motion-lora-tilt-down
guoyww/animatediff-motion-lora-tilt-up
BAAI/Emu3-Stage1
tencent/HunyuanVideo-1.5
BAAI/EVA-CLIP-18B
BAAI/BGE-VL-base
BAAI/EVA-CLIP-8B
stabilityai/stable-point-aware-3d
BAAI/SegVol
nvidia/parakeet-unified-en-0.6b
google/tipsv2-b14-dpt
BAAI/Emu3.5
tiiuae/siglino-0.6B
BAAI/RoboBrain2.0-7B
google/tipsv2-g14-dpt
deepseek-ai/JanusFlow-1.3B
stabilityai/japanese-stable-clip-vit-l-16
tiiuae/siglino-70M
tiiuae/siglino-30M
BAAI/BGE-VL-MLLM-S2
nvidia/asset-harvester
BAAI/Emu3.5-Image
BAAI/bge-reasoner-embed-qwen3-8b-0923
google/tipsv2-so400m14-dpt
BAAI/bge-m3-retromae
BAAI/Video-XL-2
BAAI/BGE-VL-MLLM-S1
HuggingFaceH4/tiny-random-LlamaForSequenceClassification
BAAI/RoboBrain2.0-3B
moonshotai/Kimi-Audio-7B
BAAI/BGE-VL-v1.5-zs
BAAI/EVA-CLIP-8B-448
tiiuae/siglino-moe-0.15-0.6B
BAAI/BGE-VL-Screenshot
tiiuae/siglino-moe-0.3-0.6B
BAAI/BGE-VL-v1.5-mmeb
HuggingFaceH4/vsft-llava-1.5-7b-hf-trl
stabilityai/stable-diffusion-xl-refiner-0.9
stabilityai/stable-codec-speech-16k
HuggingFaceH4/Qwen2.5-Math-1.5B-Instruct-PRM-0.2
stabilityai/stable-video-diffusion-img2vid-xt-1-1-tensorrt
stabilityai/japanese-instructblip-alpha
stabilityai/japanese-stable-vlm
BAAI/RoboBrain2.0-32B
HuggingFaceH4/Qwen2.5-Math-7B-Instruct-PRM-0.2
HuggingFaceH4/tiny-random-LlamaForSeqClass
stabilityai/stable-codec-speech-16k-base
BAAI/RoboBrain-X0-Preview
stabilityai/sv3d
stabilityai/stable-zero123
No models match these filters
Try adjusting or clearing your filters
Showing top 631 models. Use search above to find any of the 7,438+ models.
It depends on model size and quantization. The formula is: VRAM (GB) = parameters × bytes_per_param + 1.5 GB overhead. Q4_K_M uses 0.5 bytes/param, Q8 uses 1.0, FP16 uses 2.0. A 7B model needs ~5 GB at Q4_K_M and ~16 GB at FP16. Use the search above to find exact estimates for any model.
The NVIDIA RTX 4090 (24 GB) is the best consumer GPU for local LLM inference — it fits 13B–34B models at Q8 and 7B at FP16. The RTX 5090 (32 GB) extends that to 34B at Q8. For larger models like 70B, an Apple Silicon Mac Studio M4 Max (64–128 GB unified memory) is often more practical than a multi-GPU PC setup.
A single RTX 4090 (24 GB) is not enough for Llama 3 70B — it requires ~37 GB at Q4_K_M. You need a Mac Studio M4 Max with 64 GB+, dual RTX 4090s (48 GB combined via llama.cpp split), or an RTX 5090. The 8B variant runs easily on a single RTX 4060 or Mac mini M4.
Quantization compresses model weights to fewer bits, reducing VRAM at a small quality cost. Q4_K_M (4-bit) halves VRAM vs FP16 with ~1–3% quality loss — the most popular format for consumer GPUs. Q8 (8-bit) is near-lossless. FP16 gives maximum quality but requires the most VRAM. Choosing the right quantization can mean the difference between a model fitting on your GPU or not.
Yes — Apple Silicon is excellent for local LLM inference. The Mac mini M4 (16 GB) handles 7B–13B models. The Mac Studio M4 Pro (24–48 GB) covers 13B–34B models. The Mac Studio M4 Max (64–128 GB) can run 70B models at Q8 quality. Tools like Ollama, LM Studio, and llama.cpp all support Apple Silicon via Metal.
Yes — the Intel Arc B580 (12 GB GDDR6) is one of the best value GPUs for LLMs in 2026. It gives 12 GB VRAM, enough for 13B models at Q4_K_M. It works with llama.cpp via Vulkan/SYCL, LM Studio, and Jan.ai. It does not support CUDA, so it is slightly slower than NVIDIA on some tasks, but performance per dollar is outstanding for budget builds.
Yes — the RTX 3090 (24 GB VRAM) is still excellent value on the used market. It provides the same 24 GB VRAM as the RTX 4090 at roughly one-third the price, runs 32B models at Q4_K_M comfortably, and is about 20% slower per token. For users who prioritize maximum VRAM per dollar, the 3090 remains the best used-market pick for 24 GB VRAM in 2026.
Yes — the RTX 4060 Ti 16GB is one of the best value NVIDIA GPUs for local LLMs. 16 GB is enough for 13B models at Q8 (~14 GB) and 20B models at Q4_K_M (~12 GB). It is slower than the RTX 4090 but costs roughly one-third as much. See the full RTX 4060 Ti guide for benchmark speeds.
Yes — the RTX 4070 Ti Super 16GB runs 13B at Q8 and 20B at Q4_K_M with 672 GB/s bandwidth — 2.3x faster than the RTX 4060 Ti 16GB (288 GB/s) at the same VRAM. It is the fastest 16 GB consumer GPU available and sits cleanly between the 4060 Ti and RTX 4090 in price and performance.
Yes — but which version matters. The DeepSeek-R1-Distill models (7B, 8B, 14B, 32B, 70B) are standard dense models you can run on consumer hardware. The 7B distill needs ~6 GB at Q4 (RTX 4060 8GB works), the 14B needs ~9 GB at Q4 (Arc B580 12GB), and the 32B needs ~18 GB at Q4 (RTX 4090 24GB). The full DeepSeek R1 671B is a MoE model requiring server-class hardware. See the DeepSeek hardware guide for per-GPU compatibility.
Llama 4 uses Mixture of Experts (MoE) architecture, which means it requires more VRAM than the "17B" name suggests. Llama 4 Scout (17B-16E) has ~109B total parameters across 16 experts — all of which must be loaded. The Q4 GGUF files are ~58–62 GB, requiring a Mac Studio M4 Max 64GB or 128GB, or dual RTX 4090s. For easier local inference, consider Llama 3.1 8B or Llama 3.3 70B instead.
Qwen3 models are very accessible. Qwen3 4B at Q4_K_M needs ~3 GB (any 8 GB GPU); Qwen3 8B needs ~5 GB at Q4 (RTX 4060); Qwen3 14B needs ~9 GB at Q4 (12 GB GPU) or 15 GB at Q8 (16 GB GPU); Qwen3 32B needs ~18 GB at Q4 (RTX 4090 or Mac mini M4 Pro 48 GB). All Qwen3 models support thinking mode for chain-of-thought reasoning. Install via Ollama: ollama run qwen3:8b.
It depends on your workflow. Ollama is better for developers: it runs as a background API service, works in Docker, and integrates with code. LM Studio is better for beginners: it has a polished GUI, a model browser, and a built-in chat interface. Both use llama.cpp under the hood so speed is identical on the same hardware. Many users install both. See the full comparison guide.
Yes, if you want the best single-GPU performance for local AI. The RTX 4090 24GB runs Qwen3 32B at Q4_K_M (~20 GB) at 35-48 tok/s and delivers up to 110 tok/s on 8B models thanks to 1008 GB/s bandwidth. It costs more than the RTX 4080, but the extra 8 GB of VRAM is a meaningful upgrade for running 30-34B models. The main limit: 70B models still do not fit at Q4_K_M.
Gemma 3 model size determines VRAM needs. Gemma 3 4B fits in any 8 GB GPU (RTX 4060, Arc B580) at ~3 GB Q4_K_M. Gemma 3 12B needs 8 GB+ at Q4 (~8 GB). Gemma 3 27B, the flagship, needs 16 GB at Q4_K_M (~16 GB) — the RTX 4060 Ti 16GB, RTX 4080, or RTX 4090 are all good choices. On Mac, the Mac mini M4 24GB handles Gemma 3 12B comfortably.
Llama 3.3 70B is Meta's best open-source model and requires ~43 GB VRAM at Q4_K_M. The easiest option is the Mac Studio M4 Max 64GB, which runs it at 14-20 tok/s. The RTX 5090 32GB fits only the lower-quality Q2_K quantization (~26 GB). Two RTX 4090s (48 GB combined) can run it via llama.cpp tensor splitting. Note: there is no 7B or 13B Llama 3.3 — only the 70B was released. For 24 GB GPUs, Qwen3 32B at Q4_K_M is a strong alternative.
Phi-4 (14B) needs ~9 GB VRAM at Q4_K_M — any 12 GB GPU like the Intel Arc B580 or RTX 4070 handles it easily. Phi-4-mini (3.8B) needs only 2.5 GB and runs on any hardware including CPU. Despite being only 14B parameters, Phi-4 scores near 70B models on many benchmarks, making it an excellent VRAM-efficient choice for 8–12 GB GPUs.
The cheapest single device that comfortably runs Llama 3.3 70B at Q4_K_M is the Mac mini M4 Pro with 48 GB unified memory. It runs 70B at 10-14 tok/s — acceptable for personal use. The Mac Studio M4 Max 64GB is faster at 14-20 tok/s. Dual RTX 4090s give higher token speed but cost considerably more for the GPUs alone, making the Mac mini M4 Pro the budget pick for 70B inference.
Yes — your gaming GPU is exactly what local AI runs on. The GPU VRAM (not system RAM) determines which models you can run. An RTX 4060 8GB runs Qwen3 7B Q8 at 35 tok/s. An RTX 4080 16GB runs Qwen3 14B Q8 at 30 tok/s. An RTX 4090 24GB runs Qwen3 32B Q4 at 35-48 tok/s. Just install Ollama (free) and run ollama run qwen3:8b — it takes under 5 minutes.
Local LLMs are 100% private — your data never leaves your computer. Tools like Ollama and LM Studio run the model entirely on your hardware with no network connection required after the initial model download. Compare this to ChatGPT or Claude, which process your data on their servers. Local AI is ideal for confidential documents, medical notes, or any sensitive work.
8 GB VRAM is enough for useful AI. The best options: Qwen3 7B Q8_0 (7.2 GB, 35 tok/s) is the top pick for quality. Qwen3 7B Q4_K_M (4.5 GB, 50 tok/s) is faster. Phi-4 14B Q4_K_M is possible at ~8.5 GB but requires exactly matching VRAM. Gemma 3 9B Q4 (5.5 GB) and Mistral 7B Q8 (7.2 GB) are solid alternatives. RTX 4060 and RTX 3060 users have the same 8 GB VRAM — 4060 is about 20% faster.
One command installs Ollama: curl -fsSL https://ollama.com/install.sh | sh. After that, run any model with ollama run qwen3:8b. NVIDIA GPUs work automatically after driver install. AMD GPUs need ROCm: sudo apt install rocm-hip-sdk then add yourself to the render and video groups and reboot. Intel Arc uses the i915 kernel driver with OpenCL. Ubuntu 22.04/24.04 has the best support; Fedora and Arch also work well.
M4 Mac Mini 24 GB is the sweet spot: runs all 7-14B models and is the best value entry point. M4 Pro 24 GB is 2.3x faster due to 273 vs 120 GB/s bandwidth — worth the premium for daily use. M4 Pro 48 GB is the only Mac Mini that fits Llama 3.3 70B. Skip the M4 16 GB — after the OS takes 4-5 GB, only 11 GB remains for models.
For 7-14B models, buy the RTX 5080: 16 GB GDDR7 runs identical models to the 4090 at similar speed for less money. For 32B models (Llama 3.3 32B Q4_K_M is 18.5 GB), only the RTX 4090 24 GB fits. Neither runs 70B at Q4. Buy the 5080 for everyday use; buy the 4090 only if 32B inference is a specific requirement.
After the one-time hardware cost, running LLMs locally is free — no subscription, no API fees, no per-token charges. A mid-range setup (such as an RTX 4070) plus the free Ollama software lets you run Qwen3 14B indefinitely. Electricity cost is minimal: roughly a few cents per hour of inference on a mid-range GPU.
Yes, but with limitations. Laptops with discrete GPUs (RTX 4060 mobile, RTX 4070 mobile) run 7-8B models well at 20-30 t/s. Integrated graphics (Intel Iris, AMD Radeon integrated) can run 7B models at 3-8 t/s via CPU. Apple MacBook Pro with M4 is excellent — the M4 Pro 24GB MacBook Pro runs 14B models at 40+ t/s. Gaming laptops with 8-16 GB VRAM are solid LLM machines.
Ollama is the most popular choice — one command installs it, another runs a model. It supports every major model and works on Windows, Mac, and Linux. LM Studio is better if you prefer a graphical interface with no terminal. Open WebUI adds a ChatGPT-like browser interface on top of Ollama. All three are free and open source.
Check your GPU's VRAM: 6-8 GB runs 7B models, 12 GB runs 14B models, 16-24 GB runs 32B models. If you have an NVIDIA GPU from the RTX 3000 or 4000/5000 series, or AMD RX 6000/7000/9000 series, or Apple Silicon, you're good. Use the VRAM Calculator on this site to find exactly which models fit your specific GPU.
Want exact VRAM estimates? Use the VRAM Calculator or search any model above.