Follow Datanami:
October 5, 2023

How Large Language Models and Humans Can Make Strategic Decisions Together

Sagar Indurkhya


Over the last year, Large Language Models (LLMs) have taken the world by storm. Much of the public was first exposed to the revolutionary abilities of LLMs when they were introduced to OpenAI’s ChatGPT in late 2022.

Suddenly we were seeing people who had little to no knowledge of LLMs using ChatGPT to complete all sorts of tasks. Queries like “Explain a supernova to me like I’m 10 years old” take a complex concept and attempt to describe it more clearly. Users can also use ChatGPT to compose everything from an article to poetry, sometimes with incredibly comical results when specific styles and forms are requested. A funny limerick about Valentine’s Day? No problem. A sonnet about Star Wars? You got it. In a more practical realm, we’re seeing ChatGPT used to create and debug code, translate language, write emails, and more.

Whether it’s for work or for play, users now have even more options to choose from. Shortly after OpenAI released ChatGPT, other competitor LLMs made their own debut. Google released Bard, and Meta released LLaMA under a license that enabled academics to study, adjust and extend the internal mechanisms of LLMs. Since then, there has been a palpable rush in the tech industry as companies of all sizes are either developing their own LLM or trying to figure out how to derive value for their customers by leveraging the capabilities of a third-party LLM.

Given all this, it is only prudent for businesses to consider how LLMs can be integrated into their business processes in a responsible and ethical manner. Organizations should begin by understanding the risks LLMs bring with them and how these risks can be managed and mitigated.

Understanding the Risks of LLMs

As many users of LLMs have discovered over the last several months, LLMs have several failure modes that often come up.


First, LLMs will often hallucinate facts about the world that are not true. For example, when a journalist asked ChatGPT “When did The New York Times first report on ‘artificial intelligence’?”, the response was “July 10, 1956, in an article titled “Machines Will Be Capable of Learning, Solving Problems, Scientists Predict” that was about a conference at Dartmouth College.

As the Times notes, “The conference in 1956 was real. The article is not.” Such an error is possible because when you ask an LLM a question, it can fabricate a plausible-sounding answer based on the data that it was trained on. These hallucinations are often embedded within enough information, and sometimes even correct facts, that they can fool us more often than we’d like to admit.

Second, query results may reflect biases encoded in an LLM’s training data. That’s because models based on historical data are subject to the biases of the people who originally created that data. Research has shown that LLMs may draw connections between phrases appearing in their training data that reflect stereotypes such as which professions or emotions are “masculine” or “feminine.”

Moreover, bias isn’t only perpetuated in LLMs and AI processes; sometimes it’s massively amplified. CNBC reported that historical data from Chicago meant that AI algorithms based on that data amplified the discriminatory process of “redlining” and automatically denied loan applications from African Americans.

Third, LLMs often run into difficulty applying logical thinking and working with numbers. While simple mathematical questions are often solved correctly, the more complex the reasoning required to solve a question becomes, the more risk there is that an LLM will arrive at the wrong answer.

As a blog post from Google observes, typical LLMs can be thought of as employing System 1 thinking, which is “fast, intuitive, and effortless”, but lacking the ability to tap into System 2 thinking, which is “slow, deliberate, and effortful.” System 2 type thinking is a critical component of the step-by-step reasoning required to solve many mathematical questions. To Google’s credit, their blog post outlines a new method they are developing to augment their LLM, Bard, with a degree of System 2 thinking.

In every one of these cases, it is likely that an LLM will formulate a confident, definitive, and well-written response to the query. That is perhaps the most dangerous part of an LLM: An answer is always delivered, even if it’s fictional, biased, or incorrect.

These failure modes not only impact the accuracy of an AI model grounded in an LLM (e.g. a summary of an article riddled with fake citations or broken logic isn’t helpful!) but also have ethical implications. Ultimately, your clients (and regulators as well) are going to hold your business responsible if the outputs of your AI model are inaccurate.

Guarding Against the Shortcomings of LLMs

Of course, the AI engineers developing LLMs are working hard to minimize the occurrences of these failure modes and install guardrails—indeed, the progress GPT-4 has made in lessening the occurrence of these failure modes is remarkable. However, many businesses are weary of building their AI solution on top of a model hosted by another company for good reasons.


Corporations are rightfully hesitant to let their proprietary data leave their own IT infrastructure, especially when that data has sensitive information about their clients. The solution to that security problem may be to construct an internal LLM, but that requires a significant investment of time and resources.

Furthermore, without owning the LLM, users are at the mercy of third-party developers. There is no guarantee that a third party will not update their LLM model with little or no warning, and thereby introduce new examples of the aforementioned failure modes; indeed, in a production environment one wants to have strict control over when models are updated, and time is required to assess the impact downstream impact any changes may have.

Finally, depending on the use case, there may be concerns over scalability to support client demand, network latency, and costs.

For all these reasons, many businesses are designing their AI solutions so they aren’t reliant on a specific LLM—ideally, LLMs can be treated as plug-and-play so that businesses can switch between different third-party vendors or use their own internally developed LLMs, depending on their business needs.

As a result, anyone seriously considering the integration of LLMs into business processes should develop a plan for methodically characterizing the behavior patterns — in particular accuracy and instances of failure modes — so that they can make an informed decision about which LLM to use and whether to switch to another LLM.

Characterizing and Validating LLMs

One approach to characterizing the behavior patterns of an AI solution grounded in an LLM is to use other forms of AI to analyze an LLM’s outputs. Intelligent Exploration is a methodology for data exploration that is grounded in using AI routines tightly coupled with multidimensional visualizations to discover insight and illustrate it clearly. Let’s consider some ways in which Intelligent Exploration can help us mitigate several of LLM’s failure modes.

For example, suppose we would like to build a web application that lets clients ask an LLM some questions about traveling in another city, and of course, we do not want the LLM to recommend that our clients visit museums or other points of interest that do not exist due to hallucination within the LLM (e.g. if the question pertains to a fictional city). In developing the application responsibly, we may decide to characterize whether the presence of particular words in the query can increase the likelihood of the LLM hallucinating (instead of alerting the user that the city does not exist). One approach, driven by Intelligent Exploration, could be to:

(Wright Studio/Shutterstock)

  • Develop a test set of queries, some of which involve fictional cities and some of which involve real cities;
  • Train a supervised learning model  (e.g. a Random Forests model) to predict whether an LLM will hallucinate in its response given the words appearing in the prompt fed to the LLM;
  • Identify the three words that have the most predictive power (per the trained model);
  • Create a multi-dimensional plot in which the X, Y, and Z dimensions of a data point correspond to the counts (within the query) of the three words that have the most predictive power, and with the color of each point designating whether that query triggered the LLM to hallucinate.

Such an AI-driven visualization can help rapidly identify specific combinations of words that tend to either trigger the LLM into hallucinating or steer it away from hallucinating.

To take another example, suppose we want to use an LLM to decide when to approve a home loan based on a document summarizing a loan applicant, and we are concerned that the LLM may be inappropriately biased in which loans it suggests granting. We can use Intelligent Exploration to investigate this possible bias via the following process:

  • Create a network graph in which each node in the graph is a loan application document and the strength of the connection between two documents is grounded in the degree to which those two documents are related (e.g. the number of words or phrases that co-occur in the two documents)
  • Run a network community detection method (e.g. the Louvain method) to segment the network into disjoint communities
  • Run a statistical test to identify which (if any) of the communities have a proportion of rejected loan applications that is significantly different from that of the population as a whole
  • Read through a subset of the documents in a flagged community to identify whether the LLM is rejecting applicants in that community for illegitimate reasons. Or alternatively, if the loan application documents are augmented with other features – e.g. income, zip code, ethnicity, race or gender – then you can use further statistical tests to identify if a flagged community is disproportionately associated with a particular feature value.

Notably, visualizing the network graph and its communities can help ground this analysis by showing which communities are closely related to one another, which in turn can help drive further analysis.

These two examples illustrate how traditional AI routines (e.g. Random Forests or the Louvain method), when combined with multi-dimensional visualization capabilities, can help identify and investigate an LLM’s behavioral patterns and biases. Moreover, these processes can be run periodically to understand how the behavior and biases of a third-party LLM may be changing over time or to compare how another LLM you may be considering to switch to fares as compared to the LLM you are using now.

LLMs can bring significant benefits when used correctly, but they can also invite large amounts of risk. It’s up to organizations to find ways, such as developing and maintaining a suite of analytical routines grounded by Intelligent Exploration, that allow them to confidently leverage LLMs to solve business problems in a responsible, informed, and ethical manner.

About the author: Dr. Sagar Indurkhya heads the NLP group at Virtualitics, Inc. He has over eight years of experience with natural language processing (NLP) and publications in top journals and conferences in the field of computational linguistics, as well as experience consulting with a number of companies that contract for the DoD. His research work has focused on high-precision semantic parsing, the development of computational models of language acquisition grounded in linguistic theory, and black-box analysis of deep neural network-based NLP systems. He holds a Ph.D. in Computer Science from the Massachusetts Institute of Technology (MIT) with a focus on Computational Linguistics, a Masters of Engineering in Electrical Engineering & Computer Science from MIT, as well as a B.S. in Computer Science and Engineering from MIT.

Related Items:

A New Era of Natural Language Search Emerges for the Enterprise

Virtualitics Takes Data Viz Tech from Stars to Wall Street

10 NLP Predictions for 2022