Skip to content
DiffusionGemma at 1000 tokens/sec on H100, Gemini business tools, and new agent consoles | Daily AI roundup cover

DiffusionGemma at 1,000 tokens/sec on H100, Gemini business tools, and new agent consoles

By Harsh Desai
Share

TL;DR

Google and NVIDIA pushed faster local text generation while new agent tools and video models appeared on Replicate and Product Hunt.

What shipped

On 10 June several vendors released production-ready AI components. NVIDIA optimized a new Google model for local GPUs and extended confidential computing to Apple workloads. Builders also gained fresh agent consoles and a reasoning video model.

Vendor launches

NVIDIA led with two technical releases around local inference and confidential compute while Google added business features to Gemini and expanded Chrome AI capabilities. The items focus on measurable speed gains and enterprise privacy rather than awareness campaigns.

  • NVIDIA robotaxi safety NVIDIA published guidance on embedding safety directly into robotaxi operating systems instead of adding it later, aimed at fleets already operating in multiple cities.
  • NVIDIA DiffusionGemma optimization NVIDIA tuned Google DeepMind DiffusionGemma to run on RTX GPUs and DGX systems, delivering parallel text blocks instead of token-by-token output for lower latency on single-user tasks.
  • Gemini Chrome expansion Google rolled Gemini AI features into Chrome for users across Latin America, Africa, and the Middle East, extending the same capabilities already available in other regions.
  • Gemini business tools Google added new Gemini app features that let entrepreneurs draft content, analyze data, and automate routine tasks without switching between multiple apps.
  • NVIDIA Apple Private Cloud Compute NVIDIA Confidential Computing now powers server-side inference for Apple Foundation Models on Google Cloud, extending Apple Private Cloud Compute beyond Apple data centers.

Hugging Face trending

Two Gemma variants and a new audio model from Google appeared in the trending list, alongside a research paper on multimodal training choices. The models support direct download and fine-tuning through the Hub.

  • diffusiongemma-26B-A4B-it Google released diffusiongemma-26B-A4B-it, an image-text-to-text model that generates text blocks in parallel and is available for inference on Hugging Face.
  • Huihui-gemma-4-12B-it-abliterated Huihui-ai published an abliterated Gemma variant that supports any-to-any tasks and runs via the transformers library on the Hub.
  • magenta-realtime-2 Google placed magenta-realtime-2 on the Hub, a text-to-audio model built with its Magenta library for real-time audio generation.
  • Multimodal learning phase diagram A new paper examines when cross-modal alignment or prediction works best, giving practitioners a decision framework for multimodal projects.

Replicate new models

ray-3.2: Luma released ray-3.2 on Replicate, a video model that creates 5- or 10-second cinematic clips from text or images with native HDR and EXR output.

Product Hunt picks

Three agent-focused tools launched, covering local trust consoles, serverless agent hosting, and live speech translation.

  • Timmy-TUI Timmy-TUI offers a local-first console for managing agent trust and workspaces without sending data to remote servers.
  • AGNT.Hub AGNT.Hub lets users run always-on agents without provisioning or maintaining servers.
  • Gemini 3.5 Live Translate Google introduced Gemini 3.5 Live Translate, an audio model for real-time speech-to-speech translation.

Industry news

Spending data and model releases highlighted the cost of staying competitive while OpenAI signaled a possible 2027 IPO timeline. DiffusionGemma received independent coverage for its speed-quality trade-off.

  • AI spending levels Ramp data shows the heaviest AI users now spend about $7,500 per employee each month on tools and infrastructure.
  • DiffusionGemma speed Google DiffusionGemma reaches roughly 1,000 tokens per second on one H100, four times faster than standard autoregressive models, at the expense of output quality.

Other

LangChain and Databricks released infrastructure updates for agent traces and custom model serving while HeyGen published a practical video tutorial.

  • SmithDB full-text search LangChain added inverted-index search over agent traces stored in object storage, achieving 400 ms median latency on nested JSON.
  • LangChain headless agents LangChain introduced client-side tool execution so agents can access browser APIs and local state without server round-trips.
  • HeyGen AI video tutorial HeyGen released a step-by-step guide for producing AI videos in 2026 using current tools and workflows.
  • Databricks adaptive serving Databricks launched a serving platform that automatically adjusts inference settings to match custom model requirements.

What this means for you

For Vibe Builders: You can now test DiffusionGemma locally on RTX hardware for fast parallel text output and drop AGNT.Hub or Timmy-TUI into projects to run agents without servers. Gemini business features let you automate drafting and data tasks inside one app. The new Replicate video model gives quick access to cinematic clips with HDR export.

For Non-techies: Gemini tools added in the app and Chrome now handle everyday business writing and translation for users in more regions. Live Translate and the HeyGen tutorial make it simpler to create or localize short videos without hiring specialists. Spending reports show many companies are already paying thousands per person for these capabilities.

For Developers: NVIDIA optimizations and Apple Private Cloud Compute integration give concrete benchmarks for confidential inference on H100 and RTX hardware. LangChain updates for client-side execution and SmithDB search let you move agent state and traces closer to the user. Watch the 1,000 tokens-per-second DiffusionGemma numbers against your current autoregressive stack before adopting.

What to watch next

Track Hugging Face downloads for diffusiongemma-26B-A4B-it and any follow-up quality fixes from Google. Monitor Replicate run counts for ray-3.2 and new Product Hunt agent consoles for adoption signals. Check next earnings calls for updated AI infrastructure spend figures.

Harshs take

The day split between headline speed claims and quiet infrastructure releases. DiffusionGemma trades quality for tokens per second while most new agent tools still require manual trust and sandbox setup. Heavy monthly spend numbers from Ramp suggest the real constraint is integration cost rather than model access. Builders should run a single local DiffusionGemma workload on their own RTX card this week and measure end-to-end latency against their current stack before adding another hosted service.

by Harsh Desai

Sources

Vendor launches

Hugging Face trending

Replicate new models

Product Hunt picks

Industry news

Other

More AI news

Everything AI. One email.
Every Monday.

New tools. Model launches. Plugins. Repos. Tactics. The moves the sharpest builders are making right now, before everyone else.

No spam. Unsubscribe anytime.