Show HN: Magnitude – open-source, AI-native test framework for web apps
github.comHey HN, Anders and Tom here - we’ve been building an end-to-end testing framework powered by visual LLM agents to replace traditional web testing.
We know there's a lot of noise about different browser agents. If you've tried any of them, you know they're slow, expensive, and inconsistent. That's why we built an agent specifically for running test cases and optimized it just for that:
- Pure vision instead of error prone "set-of-marks" system (the colorful boxes you see in browser-use for example)
- Use tiny VLM (Moondream) instead of OpenAI/Anthropic computer use for dramatically faster and cheaper execution
- Use two agents: one for planning and adapting test cases and one for executing them quickly and consistently.
The idea is the planner builds up a general plan which the executor runs. We can save this plan and re-run it with only the executor for quick, cheap, and consistent runs. When something goes wrong, it can kick back out to the planner agent and re-adjust the test.
It’s completely open source. Would love to have more people try it out and tell us how we can make it great.
> The idea is the planner builds up a general plan which the executor runs. We can save this plan and re-run it with only the executor for quick, cheap, and consistent runs. When something goes wrong, it can kick back out to the planner agent and re-adjust the test.
I've been recently thinking about testing/qa w/ VLMs + LLMs, one area that I haven't seen explored (but should 100% be feasible) is to have the first run be LLM + VLM, and then have the LLM(s?) write repeatable "cheap" tests w/ traditional libraries (playwright, puppeteer, etc). On every run you do the "cheap" traditional checks, if any fail go with the LLM + VLM again and see what broke, only fail the test if both fail. Makes sense?
So this is a path that we definitely considered. However we think its a half-measure to generate actual Playwright code and just run that. Because if you do that, you still have a brittle test at the end of the day, and once it breaks you would need to pull in some LLM to try and adapt it anyway.
Instead of caching actual code, we cache a "plan" of specific web actions that are still described in natural language.
For example, a cached "typing" action might look like: { variant: 'type'; target: string; content: string; }
The target is a natural language description. The content is what to type. Moondream's job is simply to find the target, and then we will click into that target and type whatever content. This means it can be full vision and not rely on DOM at all, while still being very consistent. Moondream is also trivially cheap to run since it's only a 2B model. If it can't find the target or it's confidence changed significantly (using token probabilities), it's an indication that the action/plan requires adjustment, and we can dynamically swap in the planner LLM to decide how to adjust the test from there.
Did you consider also caching the coordinates returned by moondream? I understand that it is cheap, but it could be useful to detect if an element has changed position as it may be a regression
So the problem is if we cache the coordinates and click blindly at the saved positions, there's no way to tell if the interface changes or if we are actually clicking the wring things (unless we try and do something hacky like listen for events on the DOM). Detecting whether elements have changed position though would definitely be feasible if re-running a test with Moondream, could compared against the coordinates of the last run.
sounds a lot like snapshot testing
This is exactly our workflow, though we defined our own YAML spec [1] for reasons mentioned in previous comments.
We have multiple fallbacks to prevent flakes; The "cheap" command, a description of the intended step, and the original prompt.
If any step fails, we fall back to the next source.
1. https://docs.testdriver.ai/reference/test-steps
Thanks for sharing, this looks interesting.
However, I do not see a big advantage over Cypress tests.
The article mentions shortcomings of Cypress (and Playwright):
> They start a dev server with bootstrapping code to load the component and/or setup code you want, which limits their ability to handle complex enterprise applications that might have OAuth or a complex build pipeline.
The simple solution is to containerise the whole application (including whatever OAuth provider is used), which then allows you to simply launch the whole thing and then run the tests. Most apps (especially in enterprise) should already be containerised anyway, so most of the times we can just go ahead and run any tests against them.
How is SafeTest better than that when my goal is to test my application in a real world scenario?
This looks pretty cool, at least at first glance. I think "traditional web testing" means different things to different people. Last year, the Netflix engineering team published "SafeTest"[1] an interesting hybrid / superset of unit and e2e testing. Have you guys (Magnitude devs) considered incorporating any of their ideas?
1. https://netflixtechblog.com/introducing-safetest-a-novel-app...
Looks cool! Thanks for sharing! The idea of having a hybrid framework for component unit testing + end to end testing is neat. Will definitely consider how this might be applicable to magnitude.
Any advice about using ai to write test cases? For example recording a video while using an app and converting that to test cases. Seems like it should work.
It looks pretty cool. One thing that has bothered me a bit with Playwright is audio input. With modern AI applications, speech recognition is often integrated, but with Playwright, using voice as an input does not seem straightforward. Given that Magnitude has an AI focus, adding a feature like that would be great:
This is pretty much exactly what I was going to build. It's missing a few things, so I'll either be contributing or forking this in the future.
I'll need a way to extract data as part of the tests, like screenshots and page content. This will allow supplementing the tests with non-magnitude features, as well as add things that are a bit more deterministic. Assert that the added todo item exactly matches what was used as input data, screenshot diffs when the planner fallback came into play, execution log data, etc.
This isn't currently possible from what I can see in the docs, but maybe I'm wrong?
It'd also be ideal if it had an LLM-free executor mode to reduce costs and increase speed (caching outputs, or maybe use accessibility tree instead of VLM), and also fit requirements when the planner should not automatically kick in.
Hey, awesome to hear! We are definitely open to contributions :)
We plan to (very soon) enable mixing standard Playwright or other code in between Magnitude steps, which should enable doing exact assertions or anything else you want to do.
Definitely understand the need to reduce costs / increase speed, which mainly we think will be best enabled by our plan-caching system that will get executed by Moondream (a 2B model). Moondream is very fast and also has self-hosted options. However there's no reason we couldn't potentially have an option to generate pure Playwright for people who would prefer to do that instead.
We have a discord as well if you'd like to easily stay in touch about contributing: https://discord.gg/VcdpMh9tTy
Interesting! My first concern is - isn’t this the ultimate non-deterministic test? In practice, does it seem flaky?
So the architecture is built with determinism in mind. The plan-caching system is still a work in progress, but especially once fully implemented it should be very consistent. As long as your interface doesn't change (or changes in trivial ways), Moondream alone can execute the same exact web actions as previous test runs without relying on any DOM selectors. When the interface does eventually change, that's where it becomes non-deterministic again by necessity, since the planner will need to generatively update the test and continue building the new cache from there. However once it's been adapted, it can once again be executed that way every time until the interface changes again.
In a way, nondeterminism could be an advantage. Instead of using these as unit tests, use them as usability tests. Especially if you want your site to be accessible by AI agents, it would be good to have a way of knowing what tweaks increase the success rate.
Of course that would be even more valuable for testing your MCP or A2A services, but could be useful for UI as well. Or it could be useless. It would be interesting to see if the same UI changes affect both human and AI success rate in the same way.
And if not, could an AI be trained to correlate more closely to human behavior. That could be a good selling point if possible.
Originally we were actually thinking about doing exactly this and building agents for usability testing. However, we think that LLMs are much better suited for tackling well defined tasks rather than trying to emulate human nuance, so we pivoted to end-to-end testing and figuring out how to make LLM browser agents act deterministically.
I know moondream is cheap / fast and can run locally, but is it good enough? In my experience testing things like Computer Use, anything but the large LLMs has been so unreliable as to be unworkable. But maybe you guys are doing something special to make it work well in concert?
So it's key to still have a big model that is devising the overall strategy for executing the test case. Moondream on its own is pretty limited and can't handle complex queries. The planner gives very specific instructions to Moondream, which is just responsible for locating different targets on the screen. It's basically just the layer between the big LLM doing the actual "thinking" and grounding that to specific UI interactions.
Where it gets interesting, is that we can save the execution plan that the big model comes up with and run with ONLY Moondream if the plan is specific enough. Then switch back out to the big model if some action path requires adjustment. This means we can run repeated tests much more efficiently and consistently.
Ooh, I really like the idea about deciding whether to use the big or small model based on task specificity.
You might like https://pypi.org/project/llm-predictive-router/
Oh this is interesting. In our case we are being very specific about which types of prompts go where, so the planner essentially creates prompts that will be executed by Moondream, instead of trying to route prompts generally to the appropriate model. The types of requests that our planner agent vs Moondream can handle are fundamentally different for our use case.
interesting, will check out yours i'm mostly interested on these dynamic routers so I can mix local and API based depending on needs, i cannot run some models locally but most of the tasks don't even require such power (on building ai agentic systems)
there's also https://github.com/lm-sys/RouteLLM
and other similar
I guess your system is not as open-ended task oriented so you can just build workflows deciding which model to use at each step, these routing mechanisms are more useful for open-ended tasks that dont fit on a workflow so well (maybe?)
> Pure vision instead of error prone "set-of-marks" system (the colorful boxes you see in browser-use for example)
One benefit not using pure vision is that it's a strong signal to developers to make pages accessible. This would let them off the hook.
Perhaps testing both paths separately would be more appropriate. I could imagine a different AI agent attempting to navigate the page through accessibility landmarks. Or even different agents that simulate different types of disabilities.
Yeah good criticism for sure. We definitely want to keep this in mind as we continue to build. Some kind of accessibility tests which run in parallel with each visual test that are only allowed to use the accessibility tree could make it much easier for developers to identify how to address different accessibility concerns.
Why not make the strong model compile a non-ai-driven test execution plan using selectors / events? Is Moondream that good?
Definitely a good question. Using an actual LLM as the execution layer allows us to more easily swap to the planner agent in the case that the test needs to be adapted. We don’t want to store just a selector based test because it’s difficult to determine when it requires adaptation, and is inherently more brittle to subtle UI changes. We think using a tiny model like Moondream makes this cheap enough that these benefits outweigh an approach where we cache actual playwright code.
Does it only work for node projects? Can I run it against a Staging environment without mixing it with my project?
You can run it against any URL, not just node projects! You'll still need a skeleton node project for the actual Magnitude tests, but you could configure some other public or staging URL as the target site.
How does Magnitude differentiate between the planner and executor LLM roles, and how customizable are these components for specific test flows?
So the prompts that are sent to the planner vs executor are completely distinct. We allow complete customization of the planner LLM with all major providers (Anthropic, OpenAI, Google AI Studio, Google Vertex AI, AWS Bedrock, OpenAI compatible). The executor LLM on the other hand has to fit very specific criteria, so we only support the Moondream model right now. For a model to act as the executor it needs to be able to specific specific pixel coordinates (only a few models support this, for example OpenAI/Anthropic computer use, Molmo, Moondream, and some others). We like Moondream because its super tiny and fast (2B). This means as long as we still have a "smart" planner LLM we can have very fast/cheap execution and precise UI interaction.
does Moondream handle multi-step UI tasks reliably (like opening a menu, waiting for render, then clicking), or do you have to scaffold that logic separately in the planner?
The planner can plan out multiple web actions at once, which Moondream can then execute in sequence on its own. So Moondream is never deciding how to execute more than one web action in a single prompt.
What this really means for developers writing the tests is you don't really have to worry about it. A "step" in Magnitude can map to any number of web actions dynamically based on the description, and the agents will figure out how to do it repeatably.
Bang me sideways, "AI-native" is a thing now? What does that even mean?
It definitely means something, probably an app designed around being interacted by with an LLM, upon first hearing it. Browser interaction is one of those things that is a great killer app for LLMs IMO.
For instance, I just discovered there are a ton of high quality scans of film and slides available at the Library of Congress website, but I don't really enjoy their interface. I could build a scraping tool and get too much info, or suffer and use just clicking through their search UI. Or I could ask my browser tool wielding LLM agent to automate the boring stuff and provide a map of the subjects I would be interested in, and give me a different way to discover things. I've just discovered the entire browser automation thing, and I'm having fun have my LLM go "research" for a few minutes while I go do something else.
Had to look it up too! https://www.studioglobal.ai/blog/what-is-ai-native
Well yeah it's kind of ambiguous, it's just our way of saying that we're trying to use AI to make testing easier!
Hi, this looks great! Any plans to support Azure OpenAI as a backend?
Hey! We can add this pretty easily! We find that Gemini Pro 2.5 works the best as the planner model by a good margin, but we definitely want to support a variety of providers. I'll keep this in mind and implement soon!
edit: tracking here https://github.com/magnitudedev/magnitude/issues/6