

(Adam Flaherty/Shutterstock)
AI’s black box problem has been building ever since deep learning models started gaining traction about 10 years ago. But now that we’re in the post-ChatGPT era, the black box fears of 2022 seem quaint to Shayan Mohanty, co-founder and CEO at Watchful, a San Francisco startup hoping to deliver more transparency into how large language models work.
“It’s almost hilarious in hindsight,” Mohanty says. “Because when people were talking about black box AI before, they were just talking about big, complicated models, but they were still writing that code. They were still running it within their four walls. They owned all the data they were training it on.
“But now we’re in this world where it’s like OpenAI is the only one who can touch and feel that model. Anthropic is the only one who can touch and feel their model,” he continues. “As the user of those models, I only have access to an API, and that API allows me to send a prompt, get a response, or send some text and get an embedding. And that’s all I have access to. I can’t actually interpret what the model itself is doing, why it’s doing it.”
That lack of transparency is a problem, from a regulatory perspective but also just from a practical viewpoint. If users don’t have a way to measure whether their prompts to GPT-4 are eliciting worthy responses, then they don’t have a way to improve them.
There is a method to elicit feedback from the LLMs called integrated gradients, which allows users to determine how the input to an LLM impacts the output. “It’s almost like you have a bunch of little knobs,” Mohanty says. “These knobs might represent words in your prompt, for instance…As I tune things up, I see how that changes the response.”
The problem with integrated gradients is that it’s prohibitively expensive to run. While it might be feasible for large companies to use it on their own LLM, such as Llama-2 from Meta AI, it’s not a practical solution for the many users of vendor solutions, such as OpenAI.
“The problem is that there aren’t just well-defined methods to infer” how an LLM is running, he says. “There aren’t well-defined metrics that you can just look at. There’s no canned solution to any of this. So all of this is going to have to be basically greenfield.”
Greenfielding Blackbox Metrics
Mohanty and his colleagues at Watchful have taken a stab at creating performance metrics for LLMs. After a period of research, they hit upon a new technique that delivers results that are similar to the integrated gradients technique, but without the huge expense and without needing direct access to the model.
“You can apply this approach to GPT-3, GPT-4, GPT-5, Claude–it doesn’t really matter,” he says. “You can plug in any model to this process, and it’s computationally efficient and it predicts really well.”
The company today unveiled two LLM metrics based on that research, including Token Importance Estimation and Model Uncertainty Scoring. Both of the metrics are free and open source.
Token Importance Estimation gives AI developers an estimate of token importance within prompts using advanced text embeddings. You can read more about it here. Model Uncertainty Scoring, meanwhile, evaluates the uncertainty of LLM responses, along the lines of conceptual and structural uncertainty. You can read more about it at this link.
Both of the new metrics are based on Watchful’s research into how LLMs interact with the embedding space, or the multi-dimensional area where text inputs are translated into numerical scores, or embeddings, and where the relatively proximity of those scores can be calculated, which is central to how LLMs work.

Watchful’s new Token Importance Estimator tells you which words in your prompt have the biggest impact (Image source: Watchful)
LLMs like GPT-4 are estimated to have 1,500 dimensions in their embedding space, which is simply beyond human comprehension. But Watchful has come up with a way to programmatically poke and prod at its mammoth embedding space through prompts sent via API, in effect gradually exploring how it works.
“What’s happening is that we take the prompt and we just keep changing it in known ways,” Mohanty says. “So for instance, you could drop each token one by one, and you could see, okay, if I drop this word, here’s how it changes the model’s interpretation of the prompt.”
While the embedding space is very large, it’s finite. “You’re just given a prompt, and you can change it in various ways that again, are finite,” Mohanty says. “You just keep re-embedding that, and you see how those numbers change. Then we can calculate statistically, what the model is likely doing based on seeing how changing the prompt affects the model’s interpretation in the embedding space.”
The result of this work is a tool that might show that the very large prompts a customer is sending GPT-4 are not having the desired impact. Perhaps the model is simply ignoring two of the three examples that are included in the prompt, Mohanty says. That could allow the user to immediately reduce the size of the prompt, saving money and providing a timelier response.
Better Feedback for Better AI
It’s all about providing a feedback mechanism that has been missing up to this point, Mohanty says.
“Once someone wrote a prompt, they didn’t really know what they needed to do differently to get a better result,” Mohany says. “Our goal with all this research is just to peel back the layers of the model, allow people to understand what it’s doing, and do it in a model-agnostic way.”
The company is releasing the tools as open source as a way to kickstart the movement toward better understanding of LLMs and toward fewer black box question marks. Mohanty would expect other members of the community to take the tools and build on them, such as integrating them with LangChain and other components of the GenAI stack.
“We think it’s the right thing to do,” he says about open sourcing the tools. “We’re not going to arrive at a point very quickly where everyone converges, where these are the metrics that everyone cares about. The only way we get there is by everyone sharing how you’re thinking about this. So we took the first couple of steps, we did this research, we discovered these things. Instead of gating that and only allowing it to be seen by our customers, we think it’s really important that we just put it out there so that other people can build on top of it.”
Eventually, these metrics could form the basis for an enterprise dashboard that would inform customers how their GenAI applications are functioning, sort of like TensorBoard does for TensorFlow. That product would be sold by Watchful. In the meantime, the company is content to share its knowledge and help the community move toward a place where more light can shine on black box AI models.
Related Items:
Opening Up Black Boxes with Explainable AI
In Automation We Trust: How to Build an Explainable AI Model
It’s Time to Implement Fair and Ethical AI
December 7, 2023
- Sprinklr Empowers Businesses to Deploy and Scale Generative AI-powered Conversational Bots
- KNIME Releases Improved UI, Enhanced AI Assistant, Modernized Scripting Experience with AI, and More
- EY Report Highlights: Generational Divide in AI Adoption and Perception in the Workforce
- Bigeye Receives Strategic Investment from Alteryx Ventures
December 6, 2023
- Astronomer Unveils Latest Astro Release with Advanced Security and Cost-Savings Features
- Asato Secures $7.5M Investment to Support Development of AI Copilot Platform
- AMD Instinct MI300 Series Launch: Accelerating Next-Gen AI and Supercomputing
- SQream Achieves SOC-2 Type II Compliance Certification for Its Cloud-Native Data Lakehouse ‘Blue’
- Ataccama Announces ONE AI for Improved Automated Data Governance
- 10% of Organizations Surveyed Launched GenAI Solutions to Production in 2023
- SingleStore to Launch Hybrid Vector and Full-Text Search Capabilities as a Snowflake Native App on the Snowflake Data Cloud
- Snowplow Launches Snowplow Digital Analytics as a Snowflake Native App, in the Data Cloud
- Hitachi Vantara Launches Unified Compute Platform Integrated with GKE Enterprise to Simplify Hybrid Cloud Management
- Red Hat Reports: IT Modernization and Open Source Adoption Key to Overcoming Skills Shortfalls
December 5, 2023
- Nexusflow Unveils NexusRaven-V2, Offering Advanced Software Tool Use Beyond GPT-4 Capabilities
- Alteryx Research Outlines the Challenges Facing the Enterprise of the Future
- Unravel Data Partners with Databricks for Lakehouse Observability and FinOps
- Mine Secures $30M in Series B Funding to Transform Data Privacy Governance for Enterprises
- Comet Now Available on Amazon Marketplace to Help Organizations Achieve Their Business Goals with ML
- Pluralsight Study Shows Disparity Between AI Initiatives and Workforce Readiness
Most Read Features
- Databricks Bucks the Herd with Dolly, a Slim New LLM You Can Train Yourself
- Big Data File Formats Demystified
- Altman’s Back As Questions Swirl Around Project Q-Star
- Data Mesh Vs. Data Fabric: Understanding the Differences
- Quantum Computing and AI: A Leap Forward or a Distant Dream?
- Patterns of Progress: Andrew Ng Eyes a Revolution in Computer Vision
- AWS Adds Vector Capabilities to More Databases
- Taking GenAI from Good to Great: Retrieval-Augmented Generation and Real-Time Data
- Five AWS Predictions as re:Invent 2023 Kicks Off
- How Generative AI Is Transforming the Call Center Market
- More Features…
Most Read News In Brief
- Mathematica Helps Crack Zodiac Killer’s Code
- Databricks: We’re a Data Intelligence Platform Now
- Pandas on GPU Runs 150x Faster, Nvidia Says
- GenAI Debuts Atop Gartner’s 2023 Hype Cycle
- Retool’s State of AI Report Highlights the Rise of Vector Databases
- Amazon Launches AI Assistant, Amazon Q
- AWS Launches High-Speed Amazon S3 Express One Zone
- New Data Unveils Realities of Generative AI Adoption in the Enterprise
- Big Growth Forecasted for Big Data
- Anaconda’s Commercial Fee Is Paying Off, CEO Says
- More News In Brief…
Most Read This Just In
- Salesforce Announces New Automotive Cloud Features
- Martian Raises $9M for Advanced Model Mapping to Enhance LLM Performance and Accuracy
- DataStax Launches New Integration with LangChain, Enables Developers to Build Production-ready Generative AI Applications
- Dremio Delivers GenAI-Powered Data Discovery and Unified Path to Apache Iceberg on the Data Lakehouse
- HPE Collaborates with NVIDIA to Deliver an Enterprise-Class, Full-Stack GenAI Solution
- Voltron Data Launches Theseus to Unlock the Power of the Largest Data Sets for AI
- Amazon Aurora MySQL zero-ETL Integration with Amazon Redshift Now Generally Available
- Terra Quantum Announces Partnership with NVIDIA for Quantum-Enhanced Data Analytics
- AWS Announces 4 Zero-ETL Integrations to Make Data Access and Analysis Faster and Easier Across Data Stores
- AMD Instinct MI300 Series Launch: Accelerating Next-Gen AI and Supercomputing
- More This Just In…
Sponsored Partner Content
-
Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023
-
The Art of Mastering Data Quality for AI and Analytics
-
Navigating the AI era: How to empower data engineers for success
-
TileDB Adds Vector Search Capabilities
-
The uses and abuses of Cloud Data Warehouses
-
4 Tips For Migrating From Proprietary to Open Source Solutions