How Large Language Models and Humans Can Make Strategic Decisions Together
Over the last year, Large Language Models (LLMs) have taken the world by storm. Much of the public was first exposed to the revolutionary abilities of LLMs when they were introduced to OpenAI’s ChatGPT in late 2022.
Suddenly we were seeing people who had little to no knowledge of LLMs using ChatGPT to complete all sorts of tasks. Queries like “Explain a supernova to me like I’m 10 years old” take a complex concept and attempt to describe it more clearly. Users can also use ChatGPT to compose everything from an article to poetry, sometimes with incredibly comical results when specific styles and forms are requested. A funny limerick about Valentine’s Day? No problem. A sonnet about Star Wars? You got it. In a more practical realm, we’re seeing ChatGPT used to create and debug code, translate language, write emails, and more.
Whether it’s for work or for play, users now have even more options to choose from. Shortly after OpenAI released ChatGPT, other competitor LLMs made their own debut. Google released Bard, and Meta released LLaMA under a license that enabled academics to study, adjust and extend the internal mechanisms of LLMs. Since then, there has been a palpable rush in the tech industry as companies of all sizes are either developing their own LLM or trying to figure out how to derive value for their customers by leveraging the capabilities of a third-party LLM.
Given all this, it is only prudent for businesses to consider how LLMs can be integrated into their business processes in a responsible and ethical manner. Organizations should begin by understanding the risks LLMs bring with them and how these risks can be managed and mitigated.
Understanding the Risks of LLMs
As many users of LLMs have discovered over the last several months, LLMs have several failure modes that often come up.
First, LLMs will often hallucinate facts about the world that are not true. For example, when a journalist asked ChatGPT “When did The New York Times first report on ‘artificial intelligence’?”, the response was “July 10, 1956, in an article titled “Machines Will Be Capable of Learning, Solving Problems, Scientists Predict” that was about a conference at Dartmouth College.
As the Times notes, “The conference in 1956 was real. The article is not.” Such an error is possible because when you ask an LLM a question, it can fabricate a plausible-sounding answer based on the data that it was trained on. These hallucinations are often embedded within enough information, and sometimes even correct facts, that they can fool us more often than we’d like to admit.
Second, query results may reflect biases encoded in an LLM’s training data. That’s because models based on historical data are subject to the biases of the people who originally created that data. Research has shown that LLMs may draw connections between phrases appearing in their training data that reflect stereotypes such as which professions or emotions are “masculine” or “feminine.”
Moreover, bias isn’t only perpetuated in LLMs and AI processes; sometimes it’s massively amplified. CNBC reported that historical data from Chicago meant that AI algorithms based on that data amplified the discriminatory process of “redlining” and automatically denied loan applications from African Americans.
Third, LLMs often run into difficulty applying logical thinking and working with numbers. While simple mathematical questions are often solved correctly, the more complex the reasoning required to solve a question becomes, the more risk there is that an LLM will arrive at the wrong answer.
As a blog post from Google observes, typical LLMs can be thought of as employing System 1 thinking, which is “fast, intuitive, and effortless”, but lacking the ability to tap into System 2 thinking, which is “slow, deliberate, and effortful.” System 2 type thinking is a critical component of the step-by-step reasoning required to solve many mathematical questions. To Google’s credit, their blog post outlines a new method they are developing to augment their LLM, Bard, with a degree of System 2 thinking.
In every one of these cases, it is likely that an LLM will formulate a confident, definitive, and well-written response to the query. That is perhaps the most dangerous part of an LLM: An answer is always delivered, even if it’s fictional, biased, or incorrect.
These failure modes not only impact the accuracy of an AI model grounded in an LLM (e.g. a summary of an article riddled with fake citations or broken logic isn’t helpful!) but also have ethical implications. Ultimately, your clients (and regulators as well) are going to hold your business responsible if the outputs of your AI model are inaccurate.
Guarding Against the Shortcomings of LLMs
Of course, the AI engineers developing LLMs are working hard to minimize the occurrences of these failure modes and install guardrails—indeed, the progress GPT-4 has made in lessening the occurrence of these failure modes is remarkable. However, many businesses are weary of building their AI solution on top of a model hosted by another company for good reasons.
Corporations are rightfully hesitant to let their proprietary data leave their own IT infrastructure, especially when that data has sensitive information about their clients. The solution to that security problem may be to construct an internal LLM, but that requires a significant investment of time and resources.
Furthermore, without owning the LLM, users are at the mercy of third-party developers. There is no guarantee that a third party will not update their LLM model with little or no warning, and thereby introduce new examples of the aforementioned failure modes; indeed, in a production environment one wants to have strict control over when models are updated, and time is required to assess the impact downstream impact any changes may have.
Finally, depending on the use case, there may be concerns over scalability to support client demand, network latency, and costs.
For all these reasons, many businesses are designing their AI solutions so they aren’t reliant on a specific LLM—ideally, LLMs can be treated as plug-and-play so that businesses can switch between different third-party vendors or use their own internally developed LLMs, depending on their business needs.
As a result, anyone seriously considering the integration of LLMs into business processes should develop a plan for methodically characterizing the behavior patterns — in particular accuracy and instances of failure modes — so that they can make an informed decision about which LLM to use and whether to switch to another LLM.
Characterizing and Validating LLMs
One approach to characterizing the behavior patterns of an AI solution grounded in an LLM is to use other forms of AI to analyze an LLM’s outputs. Intelligent Exploration is a methodology for data exploration that is grounded in using AI routines tightly coupled with multidimensional visualizations to discover insight and illustrate it clearly. Let’s consider some ways in which Intelligent Exploration can help us mitigate several of LLM’s failure modes.
For example, suppose we would like to build a web application that lets clients ask an LLM some questions about traveling in another city, and of course, we do not want the LLM to recommend that our clients visit museums or other points of interest that do not exist due to hallucination within the LLM (e.g. if the question pertains to a fictional city). In developing the application responsibly, we may decide to characterize whether the presence of particular words in the query can increase the likelihood of the LLM hallucinating (instead of alerting the user that the city does not exist). One approach, driven by Intelligent Exploration, could be to:
- Develop a test set of queries, some of which involve fictional cities and some of which involve real cities;
- Train a supervised learning model (e.g. a Random Forests model) to predict whether an LLM will hallucinate in its response given the words appearing in the prompt fed to the LLM;
- Identify the three words that have the most predictive power (per the trained model);
- Create a multi-dimensional plot in which the X, Y, and Z dimensions of a data point correspond to the counts (within the query) of the three words that have the most predictive power, and with the color of each point designating whether that query triggered the LLM to hallucinate.
Such an AI-driven visualization can help rapidly identify specific combinations of words that tend to either trigger the LLM into hallucinating or steer it away from hallucinating.
To take another example, suppose we want to use an LLM to decide when to approve a home loan based on a document summarizing a loan applicant, and we are concerned that the LLM may be inappropriately biased in which loans it suggests granting. We can use Intelligent Exploration to investigate this possible bias via the following process:
- Create a network graph in which each node in the graph is a loan application document and the strength of the connection between two documents is grounded in the degree to which those two documents are related (e.g. the number of words or phrases that co-occur in the two documents)
- Run a network community detection method (e.g. the Louvain method) to segment the network into disjoint communities
- Run a statistical test to identify which (if any) of the communities have a proportion of rejected loan applications that is significantly different from that of the population as a whole
- Read through a subset of the documents in a flagged community to identify whether the LLM is rejecting applicants in that community for illegitimate reasons. Or alternatively, if the loan application documents are augmented with other features – e.g. income, zip code, ethnicity, race or gender – then you can use further statistical tests to identify if a flagged community is disproportionately associated with a particular feature value.
Notably, visualizing the network graph and its communities can help ground this analysis by showing which communities are closely related to one another, which in turn can help drive further analysis.
These two examples illustrate how traditional AI routines (e.g. Random Forests or the Louvain method), when combined with multi-dimensional visualization capabilities, can help identify and investigate an LLM’s behavioral patterns and biases. Moreover, these processes can be run periodically to understand how the behavior and biases of a third-party LLM may be changing over time or to compare how another LLM you may be considering to switch to fares as compared to the LLM you are using now.
LLMs can bring significant benefits when used correctly, but they can also invite large amounts of risk. It’s up to organizations to find ways, such as developing and maintaining a suite of analytical routines grounded by Intelligent Exploration, that allow them to confidently leverage LLMs to solve business problems in a responsible, informed, and ethical manner.
About the author: Dr. Sagar Indurkhya heads the NLP group at Virtualitics, Inc. He has over eight years of experience with natural language processing (NLP) and publications in top journals and conferences in the field of computational linguistics, as well as experience consulting with a number of companies that contract for the DoD. His research work has focused on high-precision semantic parsing, the development of computational models of language acquisition grounded in linguistic theory, and black-box analysis of deep neural network-based NLP systems. He holds a Ph.D. in Computer Science from the Massachusetts Institute of Technology (MIT) with a focus on Computational Linguistics, a Masters of Engineering in Electrical Engineering & Computer Science from MIT, as well as a B.S. in Computer Science and Engineering from MIT.
December 1, 2023
- Kognitos Raises $20M in Series A Funding to Automate Businesses Using Generative AI
- Voltron Data Launches Theseus to Unlock the Power of the Largest Data Sets for AI
- Insight Tech Journal Reflects on Gen AI and the Biggest IT Disruptors of 2023
- Accenture Launches Specialized Services to Help Companies Customize and Manage Foundation Models
- VAST Data’s New Platform Update Aims to Simplify AI Workflows and Hybrid Cloud Operations on AWS
November 30, 2023
- HPE Collaborates with NVIDIA to Deliver an Enterprise-Class, Full-Stack GenAI Solution
- Hitachi Vantara Introduces Pentaho+: A Simplified Platform for Trusted, GenAI-ready Data
- SAS Forecasts 2024 AI Trends: Tackling the Dark Age of Fraud with AI Solutions
- Scality’s 2024 Data Storage Predictions Reveal Continued HDD Relevance Against SSD Advances
- DataRobot Named a Leader in IDC MarketScape: Worldwide AI Governance Platforms 2023 Vendor Assessment
- HPE Fuels Business Transformation with New AI-Native Architecture and Hybrid Cloud Solutions
- Dremio Delivers GenAI-Powered Data Discovery and Unified Path to Apache Iceberg on the Data Lakehouse
- Quantum Myriad All-Flash File and Object Solution Now Generally Available
- AWS Announces 5 New Amazon SageMaker Capabilities for Scaling with Models
- Berkeley Lab’s 2023 Hopper Fellow Tackles Complex Datasets with Large-Scale Graph Analysis
- KNIME Launches AI Learnathon to Help Users Build Custom AI-Powered Data Apps – No Coding Required
November 29, 2023
- SiMa.ai and Supermicro Announce Partnership to Accelerate Power-Efficient ML at the Edge
- MongoDB Announces Atlas Vector Search Enhancement with Amazon Bedrock
- NVIDIA Brings Business Intelligence to Chatbots, Copilots and Summarization Tools with Enterprise-Grade Generative AI Microservice
- Cloudian Introduces HyperStore Bucket Migrator for the Amazon S3 Express One Zone Storage Class
Most Read Features
- Databricks Bucks the Herd with Dolly, a Slim New LLM You Can Train Yourself
- Big Data File Formats Demystified
- Data Mesh Vs. Data Fabric: Understanding the Differences
- Altman’s Back As Questions Swirl Around Project Q-Star
- Quantum Computing and AI: A Leap Forward or a Distant Dream?
- Patterns of Progress: Andrew Ng Eyes a Revolution in Computer Vision
- Taking GenAI from Good to Great: Retrieval-Augmented Generation and Real-Time Data
- Five AWS Predictions as re:Invent 2023 Kicks Off
- It’s a Snowday! Here’s the New Stuff Snowflake Is Giving Customers
- Berners-Lee Startup Seeks Disruption of the Current Web 2.0 Big Data Paradigm
- More Features…
Most Read News In Brief
- Mathematica Helps Crack Zodiac Killer’s Code
- Databricks: We’re a Data Intelligence Platform Now
- Pandas on GPU Runs 150x Faster, Nvidia Says
- GenAI Debuts Atop Gartner’s 2023 Hype Cycle
- Salesforce Report Highlights Importance of Data in the AI Revolution
- Retool’s State of AI Report Highlights the Rise of Vector Databases
- Cloudera Makes a Move in GenAI with Pinecone Partnership
- Amazon Launches AI Assistant, Amazon Q
- Big Growth Forecasted for Big Data
- New Data Unveils Realities of Generative AI Adoption in the Enterprise
- More News In Brief…
Most Read This Just In
- Salesforce Announces New Automotive Cloud Features
- DataStax Launches New Integration with LangChain, Enables Developers to Build Production-ready Generative AI Applications
- Dataiku Announces Breakthroughs in Generative AI Enterprise Applications, Safety, and Tooling
- Snowflake Puts Industry-Leading Large Language and AI Models in the Hands of All Users with Snowflake Cortex
- Martian Raises $9M for Advanced Model Mapping to Enhance LLM Performance and Accuracy
- Dremio Enhances KION Group’s Data Processing, Reducing Query Times from Half an Hour to Seconds
- Amazon Aurora MySQL zero-ETL Integration with Amazon Redshift Now Generally Available
- Terra Quantum Announces Partnership with NVIDIA for Quantum-Enhanced Data Analytics
- AWS Announces 4 Zero-ETL Integrations to Make Data Access and Analysis Faster and Easier Across Data Stores
- New NYU Report Identifies Tangible Threats Posed by Emerging Generative AI and How to Address Them
- More This Just In…
Sponsored Partner Content
December 6 - December 7