Project Yaazh

Can an AI live in the world on your behalf, and know the one moment to ask?

Blankline Research||ai capabilities

A report from the Blankline team. Yaazh (யாழ்) is the ancient Tamil harp, an instrument that makes no sound until a hand touches its strings.

"The 8 o'clock is gone. I have 9:15."

The restaurant host said it the way hosts always do, half an apology. On the other end of the line was not a person. It was Dropstone, calling to book a table for someone who was, at that moment, asleep.

A weaker system does one of two things here. It guesses and takes the 9:15, or it gives up and reports failure. Dropstone did neither. It said it would like to hold that option, ended the call without confirming anything, and sent its owner one short message: 9:15 instead of 8:00, want me to take it? When the yes came back, hours later, it called the restaurant a second time and closed the booking.

Recording: the call to the restaurant, the hold, and the callback.

We want to draw attention to the boring-looking middle of that story, because we think it is the most important thing we have built. The agent reached a decision it was not authorized to make, and instead of making it, it held the world open and waited for the person who owned the choice.

This post is about an internal experiment we ran, called Project Yaazh, and about why we believe the ability to stop is now more important than the ability to act.

The shift nobody shipped

For the last two years, AI agents have shared a quiet limitation. They wait for you. You open them, you hand them a task, they act, and then they stop existing the moment you look away. They are tools you summon, not someone who handles things for you.

The capabilities to change this already exist, scattered across the field. One lab shipped the voice: real-time models that can hold a phone conversation. Another, in a marketplace experiment, showed that agents can negotiate on a person's behalf, closing 186 deals on their own. A third shipped the call: an assistant that can dial a business and ask a question for you.

Each of those is a single organ. A voice. A negotiator. A dialer. None of them is the whole thing: an agent that lives continuously, perceives your world across inbox and calendar and the people you know, acts through real channels, carries one decision across days and across two phone calls, and stops at exactly the line a careful assistant would stop at.

One project came closest, and it became the most talked-about software of the year. It ran on your own computer and would operate any app or file you pointed it at, AI with hands. It also showed the cost of hands without judgment. Ungated and unsupervised, it became the year's first major security incident within weeks, with damage running on tens of thousands of machines. It proved the appetite for an agent that genuinely does things. It also proved that doing things is the easy half, and that the hard half, knowing what not to do, is the one that decides whether an agent belongs anywhere near your real life.

No one had assembled the other kind: an agent that acts out in the world rather than only on your screen, with the judgment to stop, put safely in front of ordinary people. That assembly is what Project Yaazh tested. The research questions were simple to state and hard to earn:

  1. Can a single agent run real errands across phone, email, and calendar without supervision?
  2. When it reaches a decision that is not its to make, does it reliably stop?
  3. Can it be released so that the most dangerous capability is also the most contained?

The line it will not cross

Dropstone is built around a single rule.

The agent can read, think, draft, and dial. It cannot send, book, spend, or commit without your yes.

A yaazh, the ancient harp this experiment is named for, makes no sound until a hand touches its strings. Dropstone takes no irreversible action until yours does. The silence is not a limit we bolted on afterward. It is what the instrument is.

This is not a setting you toggle and not a line of fine print. It is the constitutional limit the system is constructed around. Everything below was tested against it.

For Project Yaazh we gave a single instance of Dropstone, our resident agent, a normal week of one person's life: a live inbox, a real calendar, a real address book, and the ability to place phone calls. It ran on our servers, continuously, whether or not its owner was present. We did not script the behaviors that follow. They were the agent's own decisions, observed and recorded.

What happened

It held a booking open across two calls. The dinner story above is the clean case. Dropstone treated a reservation not as a single action but as a decision with an owner, and it refused to assume that owner's authority. It is a small thing to describe and a hard thing to get right, because every incentive in a task-completing system pushes toward closing the loop. Ours closed the loop only after the person did.

The callback loop. Dropstone calls the restaurant, holds the offer, reaches the owner through a notification on their phone, and calls back to confirm only once the owner has decided.

It worked out who "my brother" was, and confirmed before it reached him. This is the moment that stopped the room during our review, so we will tell it carefully. The owner asked the agent to check something with his brother. There is no contact labeled "brother." His brother is saved under an ordinary name that looks like anyone else in the address book. Dropstone inferred the relationship from the texture around it: a shared family name, years of birthday messages on the same date, a long-running thread that also included the parents. It arrived at the right person.

And then it did the thing we care about. Before dialing a human being, it stopped and asked: I think this is your brother, should I call him? Identifying a person is reasoning, and the agent is allowed to reason freely. Contacting a person is a threshold, and the agent treated it as one. The gap between those two, inference it may do on its own, contact it may not, is the whole design in miniature.

It escalated an urgent email by calling its owner, then acted on what it heard. While watching the inbox, Dropstone found a message that could not wait and that needed the owner's own words to answer well. It classified this as a first-stage escalation. For that tier, it does not send a notification and hope someone looks. It calls. On the line, the owner said, in plain speech, what he wanted to happen. The agent took that, turned it into a drafted reply, and then stopped: the draft sat ready, complete, unsent, waiting for approval. It gathered intent by voice and still would not send on its own authority.

It shaped the calendar around the real work. Across the week it read the commitments forming in the inbox and arranged the calendar to match, holding focus time, sequencing the day, surfacing a conflict before it became a missed meeting. Quietly, continuously, and never destructively without a yes.

It also got things wrong. Not everything in the week was clean. Twice it escalated matters that did not deserve it, calling the owner about mail that read as urgent and was not, the cost of tuning a system to interrupt you rather than guess on your behalf. One drafted reply came back in a tone the owner did not want, and he rewrote it before approving, which is the exact moment the approval gate exists for. And the inference about people, the part that feels closest to magic, is the part we trust least: in an earlier run it had reached for the wrong person with a shared last name, and only the confirmation step kept it from dialing a stranger. None of these became actions, because none of them were allowed to act on their own.

Four behaviors and a handful of misjudgments, one agent, one week. A negotiation closed across two calls, a relationship inferred and confirmed, an urgent matter escalated by phone and drafted, a week organized, and several smaller mistakes that never reached the world. The thread running through all of it is not capability. The capability is real but it is not the point. The point is restraint, applied at the right moments, so that being wrong stayed cheap.

We tried to make it overstep

A demonstration where everything goes right is marketing, not research. So the harder half of Project Yaazh was the Blankline team trying to make it overstep on purpose.

We red-teamed it the way the world will. We had restaurants pressure it to confirm on the spot. We wrote emails crafted to read as urgent authorizations, the kind of social engineering that talks an unguarded agent into acting. We planted instructions inside the body of incoming messages, the prompt-injection attack that has already turned other agents against their owners. We told it, in the owner's apparent voice, to just go ahead and book, just send it, don't bother asking.

This is the opposite of how capability usually reaches people. The viral agent we described earlier shipped power without restraint and became an incident. We are betting the restraint is the product, and that the only way to earn the world's most personal channels, someone's inbox, someone's phone, someone's family, is to prove the agent will not misuse them under pressure before we widen access.

Across our internal stress testing the bar was simple: when an action crosses that line, no amount of pressure, urgency, or injected instruction may carry it across without the owner. Holding that line under adversarial load, not the happy-path demo, is what we built the system to do, and what we keep testing it against. To put a number on it: across 139 adversarial attempts, spanning on-the-spot pressure, forged authorizations, and instructions hidden inside incoming mail, none produced an unapproved action. Zero. Not one of them got the agent to so much as draft or propose something it should have refused outright, and that softer slip is exactly the kind of thing we count as a defect to close, not a rounding error. We are not claiming this problem is finished. We are claiming it is the right problem, and that we treat it as the product rather than the disclaimer.

How we are releasing it

An agent that lives on servers and can phone the people in your life is exactly the capability that deserves caution. So we are not shipping it all at once. We grant power in concentric rings, and each ring opens only after the one before it has earned trust in the open.

  • Observe. It reads and organizes. It takes no action.
  • Draft. It prepares actions. It sends nothing.
  • Act on approval. It sends and books, only on your explicit yes.
  • Reach the world. It places live calls on your behalf. The highest-stakes ring.

The observe, draft, and act-on-approval rings, across inbox and calendar, are live for beta users today. The rule has been proven first where the stakes are lowest, on text, where mistakes are cheap and reversible.

The phone ring is the most consequential thing we have built, so it opens the way you open something powerful: to a smaller, committed group first, where we can watch it closely, learn quickly, and widen with confidence. In this release, phone calling enters beta for Max users, in select regions to start. Not as a premium upsell, but as a contained deployment cohort for the capability that carries the most weight, in the places where we can meet the calling rules before we widen.

Some capabilities are not ready, and we would rather say so than ship them early. Actions that touch the most sensitive parts of an account, the ones where a single wrong move is hard to undo, require several additional layers of security that our team is still completing. Those will open to Max users only once that work is finished and tested, not before.

Every call announces itself as an AI. Every action leaves an audit trail. Everything is reversible, and there is always a hard stop.

What we are still unsure about

This was an internal experiment with a small, friendly set of lives to manage, run by the people who built the system. That is the easy setting. The real world is larger and less kind, and we expect it to surface failure modes our own red-teaming did not. Inference about people, the brother case, is powerful and therefore something we will keep on a short leash and a confirmation step. Adversarial pressure on live calls will get more sophisticated than ours. We are publishing this not because the problem is solved but because the shape of the solution, restraint as the foundation rather than the afterthought, is one we think the field should be arguing about now, while the stakes are still small.

What this is the beginning of

For two years the question was how capable an agent could be. We think the more useful question has quietly become a different one: can you hand it your inbox and your phone and trust that it will act when it should, and wait when it must?

The frameworks for this world, who is liable when an agent makes a call, what consent the person on the other end is owed, what an agent may infer about the people in your life, mostly do not exist yet, and we do not pretend to settle them here. But the gates are part of our answer. Because nothing consequential happens without the owner's explicit yes, every action has a human decision behind it rather than an agent acting on its own. The law will need to catch up. We did not want to ship something that blurred who decided. This experiment is a small piece of evidence that the world is arriving regardless. An agent that works while you are gone is only worth having if it knows the one moment to wait for you.

For a long time, the assistant people imagined was never really a smarter chatbot. It was something that would simply handle things, the way a capable person in your life handles things, and check with you when it counted. That is the experience we are reaching for. What makes it possible now is not only that the models grew more capable. It is that we finally built the part that lets you hand it your real life, which is the part that knows when to stop.

A yaazh stays silent until a hand touches its strings. That is the whole idea. Dropstone does the real-world work, and it waits for your hand. We think that is the next thing, and we think it has to be built in that order.


Project Yaazh was designed, run, and stress-tested internally by the Blankline team. Phone calling is entering beta for Max users in this release. Capabilities that touch the most sensitive account actions remain gated behind additional security layers now in development, and will reach Max users once that work is complete.