Intent Extraction on On-Device Models: How Task Decomposition Increases AI Pipeline Accuracy

Intent extraction is the technical process of deciphering a user’s underlying goal from behavioral data—like scrolls, clicks, and screen transitions—before they explicitly type a query. By decomposing complex user sessions into smaller, factual summaries, engineers can deploy efficient on-device AI models that match the accuracy of massive cloud LLMs while significantly reducing cost per run and latency.

The “Why” Behind the Click

Back in 2009, my first real job was at a boutique IT consulting firm as a subcontractor for a Fortune 100 retail giant. I spent weeks staring at Apache server logs, parsing terabytes of data to figure out why the client’s checkout conversion rate had tanked. We had the what—timestamp, IP address, URL path—but we were blind to the why.

I remember sitting in a windowless server room, trying to correlate session timestamps with customer support tickets to guess if a user was confused or just browsing. It was brute-force analytics. We were basically looking at digital footprints in the mud and trying to guess if the person was running a marathon or fleeing a bear.

Fast forward to today, and the problem hasn’t changed, but the tools have. We aren’t just parsing logs; we are feeding screen interactions into multimodal LLMs. However, a recent paper from Google Research on intent extraction validated something I have suspected for years: throwing a massive model at a massive dataset is rarely the most efficient way to solve a problem. Sometimes, breaking things down—just like I had to do with those messy logs back in the day—is the only way to get a clear answer.

Google’s Shift: From Cloud Monoliths to On-Device Agents

Google recently presented research at EMNLP 2025 that points toward a "post-query" future. The premise is simple: reliable intent extraction shouldn't require a massive query sent to a cloud server. Instead, it can happen locally on your phone using small models, provided you structure the data correctly.

The researchers found that by breaking "intent understanding" into smaller, discrete steps, small multimodal models (like Gemini Nano or Flash) could match the performance of heavyweights like Gemini 1.5 Pro. This is a big deal for anyone building a RAG pipeline or an agentic workflow. It means we can stop burning cash on massive context windows and start architecting smarter pipelines.

The Decomposition Strategy

The core innovation here is decomposition. When you ask a generic AI model to look at a 5-minute user session and "tell me what they want," it tends to hallucinate. The context window gets noisy. Google’s solution splits the job into two specific mechanical steps:

  1. Step One (Screen Summarization): The model looks at a single screen interaction. It records exactly what is visible and what the user did (e.g., "User scrolled past the shoe ad and clicked the size chart"). Crucially, it makes a tentative guess about why.
  2. Step Two (Fact Aggregation): A second small model reviews the factual parts of those summaries. It explicitly ignores the previous guesses and synthesizes a final "intent statement" based only on the confirmed actions.

By stripping out the speculative guesses before the final synthesis, the system avoids the "cascade of errors" we often see in large agentic chains.

Why On-Device Decomposition Beats Cloud Inference

For engineers running data platforms or building tools like what we do at SocketStore, the implications of moving this logic on-device are practical, not just theoretical. Here is how the decomposition approach compares to the traditional monolithic cloud approach:

Feature Monolithic Cloud Model Decomposed On-Device Model
Latency High (Round trip to API) Low (Local processing)
Cost Per Run High ($0.01 - $0.05 per session) Near Zero (Compute is on user device)
Privacy Risk (Data leaves device) High (Data stays local)
Hallucination Rate Moderate (Confused by noise) Low (Filtered via decomposition)

Building the Pipeline: RAG and Embeddings

If you are building a content-factory or an analytics tool, you can apply these principles immediately. You don't need Google's proprietary tech to use decomposition. I have seen teams implement similar logic using open-source tools.

1. Ingesting the Signals

First, you need a way to capture the raw behavioral data. This is where tools like SocketStore come in handy. Our platform unifies data streams from social/web interactions into a standardized JSON format. Instead of building custom scrapers for every touchpoint, you pipe the raw event stream into your processing layer.

2. The Vector Store (Postgres)

Once you have the decomposed "facts" from step one (the screen summaries), you shouldn't just discard them. Store these factual summaries as an embedding in a vector database. I personally prefer using Postgres with the pgvector extension. It keeps your relational data (user ID, session time) sitting right next to your vector data.

When you run your RAG pipeline later, you aren't retrieving a vague "user session." You are retrieving specific, fact-checked interaction summaries. This makes the retrieval context much cleaner for the LLM.

3. Observability Evals with Bi-Fact

One of the smartest parts of the Google research was their metric, "Bi-Fact." In my experience, evaluating intent is notoriously difficult. Usually, we just vibe-check the output: "Does this look right?" That is not engineering; that is guessing.

Bi-Fact uses an F1 score to measure two specific things:

  • Missing Facts: Did the model miss a critical click?
  • Hallucinations: Did the model invent a user action that never happened?

If you are running observability evals on your AI features, stop checking for "tone" and start checking for factual retention. If your model says the user "wanted red shoes" but the user never clicked a color filter, your pipeline is broken.

User Journey Optimization & SEO

This research signals a massive shift in how we approach user journey optimization. For the last decade, SEO has been about matching keywords. If a user types "best fishing rod for bass," we serve a page with those words.

But in a world where intent is extracted before the query, keywords matter less. The device (or the browser agent) will infer that I am looking for a fishing rod because I spent 3 minutes scrolling through a lake map and checked a weather app for rain. It will prompt me with a suggestion before I type.

To rank in this environment, your content needs to align with logical user behaviors, not just text strings. You need to structure your site or app so that the "facts" of the user interaction (clicks, scrolls) map clearly to a solution. If your user journey is messy, the on-device model won't be able to summarize the intent, and you will lose the visibility.

Practical Implementation: A SocketStore Perspective

At SocketStore, we handle millions of data points for clients who need 99.9% uptime on their analytics. When we look at integrating AI into our own workflows, cost is the killer. Running a GPT-4 class model on every data packet would bankrupt us.

That is why decomposition is the only viable path forward for SaaS scaling. By breaking tasks down, we can use smaller, cheaper models for 90% of the work and save the heavy lifting for the final synthesis.

If you are an engineer looking to implement this:

  • Don't feed raw HTML or JSON logs into a large context window.
  • Do write a small script (Python/Node) to parse interactions into natural language sentences first.
  • Do use the Socket-Store Blog API approach: standardize your inputs before they hit the model. Garbage in, hallucination out.

Who Needs This Architecture?

The Soft Sell

If you are manually cobbling together scrapers to feed your data pipeline, you are likely spending more time fixing broken APIs than building your actual product. SocketStore provides the unified social and web data layer you need to feed these advanced AI models. Whether you are building a sophisticated intent extraction engine or just need clean metrics for a dashboard, our API ensures you get structured, reliable data with a guaranteed uptime. Check out our documentation to see how easy it is to integrate.

For those interested in the economics of data, view our pricing to see how we compare to building it in-house.

Frequently Asked Questions

What is the main advantage of intent extraction on-device?

The primary advantage is privacy and latency. By processing user behavior locally on the device, sensitive data regarding clicks and scrolls never hits the cloud. Additionally, it eliminates network lag, allowing for real-time UI adaptations based on user intent.

How does task decomposition reduce AI hallucinations?

Decomposition forces the AI to separate "observation" from "inference." By having one step focused solely on recording facts (what happened on screen) and a second step focused on synthesis, the model is less likely to conflate a guess with a fact. It effectively filters out noise before the final conclusion is drawn.

Can small models really match Gemini 1.5 Pro?

In specific, bounded tasks—yes. While a small 8B parameter model generally lacks the broad knowledge of a massive model, when the task is decomposed into simple steps (e.g., "summarize this screen"), the small model performs just as well as the large one, often with higher consistency because the scope is narrower.

What is the Bi-Fact metric?

Bi-Fact is an evaluation methodology that measures accuracy by comparing the generated summary against a ground-truth set of facts. It specifically calculates an F1 score based on precision (how many generated facts were true) and recall (how many true facts were captured), penalizing the model heavily for inventing actions.

How does this impact RAG pipelines?

For Retrieval-Augmented Generation (RAG), this approach suggests that we should index "summarized intents" rather than raw logs. By storing clean, decomposed summaries in your vector database (like Postgres), your retrieval system can find relevant user contexts much more accurately than if it were searching through raw, noisy clickstream data.

Do I need a Google Pixel phone to use this?

Currently, Google is researching this for their ecosystem, but the architecture of decomposition is platform-agnostic. You can implement similar logic in your own web apps using local browser-based models (like WebLLM) or server-side small language models.

What is the cost benefit of this approach?

It significantly lowers the cost per run. Instead of sending a massive prompt with thousands of tokens of history to a paid API (like GPT-4), you run small, optimized inferences. If done on-device, the marginal cost of compute is effectively zero for the service provider.