Data and Algorithms: The Building Blocks of Artificial Intelligence
At its core, Artificial Intelligence (AI) is the product of two components –the first is data and the second is algorithms. Diverse types of algorithms exist, each potentially displaying a level of complexity. Among these, neural networks stand out as multi-layered constructs that emulate the human approach to problem-solving. Neural networks are multi-layer algorithms designed to mimic the human way of approaching problem-solving.
However, data is the fuel of Artificial Intelligence. AI needs vast amounts of data in order to start generating usable insights. Large Language Models (LLM) are a subset of AI where the algorithm is designed to learn from tremendous amounts of diverse data to generate new multimodal content including text, image, audio, video, code, and 3D—hence generative AI. Without the algorithm, big data is just noise. And without data, the algorithm is irrelevant.
What we experience from generative AI feels like magic, but in reality, it is a sophisticated system that reflects provided data. In the early days of AI, everyone created their own models from scratch. But this approach was expensive and time-consuming. Now, large companies are producing Foundation Models, which other companies can use as a base. Companies can then supply their own data to supplement the base model, which tailors the responses—shifting from generic to a desired proprietary result. This makes data management more vital for organizations looking to employ AI compared to algorithm development.
AI Model Consideration
The following is a brief guide for identifying, selecting, and acquiring an LLM. There doesn’t have to be only one, but throughout this article, we will simplify the search process and add objectivity when landing on one.
- Model Selection: As mentioned above, big companies are developing base models available for use by the public. That includes the infamous ChatGPT by OpenAI, OPT by Meta, AlexaTM by Amazon, and CodeGen by Salesforce, to name a few. Many others are open-source, free, and available to download and host on local environments. HuggingFace is a good repository for finding open-source models. Organizations should select a couple models to test and validate how they work with their needs and data availability.
- Model Validation: Now that some models have been identified, an important next step is to ensure they meet business and regulatory policies. Part of validation is understanding the licensing agreements. There are different guidelines on whether the model is intended to be used commercially or privately.
- Model Size: When models are generated, they will typically come in several sizes (the number or parameters used to create the model). There are tradeoffs to be considered. Smaller models (7 billion parameters and below) are smaller on the disk and can often provide faster response times but lack the accuracy a larger model can provide. Large Models (typically 60 billion parameter and above) will provide the most accurate answers, but require a much larger compute capability, which effects hardware decisions. The larger the model, typically you’ll need more GPUs to crunch through it. There are medium sized models as well that provide a nice middle ground for many enterprises.
- Model Training: After the model is selected and validated, the fun part starts – it is time to train the model with your own data. Data should be split into training, validation, and testing sets. The training set is for the initial knowledge base creation. The validation set is for fine-tuning and optimizing performance using hyper-parameters—parameters used for control measures. The testing set is the data that the model did not get trained on. It is to ensure the model generalizes well to unseen data and avoids overfitting, when the model learns to perform on training data then fails to generalize new responses. That is an indication that the model became too specific. To train an algorithm with an organization’s own proprietary data, solutions like LlamaIndex and LangChain are needed. Both libraries are good for ingesting data, indexing, and querying. LangChain offers a few more features like chains, agents, and tools. Chains allow responses from the first prompt to be used as input into the second prompt. This makes the LLM experience conversational. Agents are chains, but they can produce the next steps autonomously. Tools are used by Agents to decide on how to query the autonomous questions.
- Model Evaluation: Model performance should be tracked against pre-set measurable
results and metrics that are designed to reflect business value and benefits. Common evaluation metrics include perplexity, which measures the model’s ability to predict the next word in a sequence, and BLEU score, which evaluates the quality of generated text compared to human references. Human evaluation, through expert reviewers or crowdsourcing, is also vital to gauge the model’s overall language understanding and coherence. Additionally, examining the model’s behavior on various prompts and assessing potential biases are essential steps in comprehensive LLM evaluation.
AI Data Configuration
With the AI model in place – expanded understanding, topic coverage, and overall intelligence come from consumed data. The model may come with public and generic data. That provides generic decisions and results. To tailor an AI’s knowledge repository and in-turn decisioning, according to an organization’s unique capabilities and positioning, the AI must be trained on proprietary home-grown data. The following is a brief guide for preparing the data for AI consumption and training:
- Business Direction and Buy-in: This is not a technical step, and it sounds cliché. But it is a must-have first step. The abundance of data can be overwhelming. Consuming data for the sake of consuming data is not a solution. There must be a clear and shared opportunity to seize or problem to solve. From there, a focus and collaboration on what to measure, capture, and collect are established.
- Data Collection: In most cases, the data already exists within the organization. Some may read this as a task to gather new data from the customer or operations. However, odds are, there is sufficient data already from contracts, orders, plans, memos, products, databases, and more. It may not all be sitting in one place though. It may have to be gathered from different departments, different systems, or different partners. Not all available data is needed, rather a comprehensive focused records to get started.
- Data Preprocessing: AI models are very forgiving regarding the type and format of data. It can be text or non-text, structured or unstructured. Cleaning and removing inconsistencies are part of pre-processing. For supervised machine learning, labeling, and codifying are necessary. Some models support unsupervised learning which is less demanding for data preprocessing. The most important aspect of this step is removing personally identifiable information (PII). The data must be anonymous and should follow privacy policies and protocols.
- Feature Engineering: This includes creating features based on things such as patient demographics, medical history, clinical notes, lab results, or other pertinent information. Extracting relevant features from the preprocessed data helps improve predictive performance. Effective feature engineering can significantly improve a model’s ability to learn and generalize, leading to better predictions and insights. However, it requires domain expertise, creativity, and careful consideration of the problem and dataset to achieve desired results.
We’ve all been collecting data for decades – data science as a practice grew from the need to create insight out of raw data. The progress of machine learning and artificial intelligence brings new meaning, role, and utility for data. Keeping data governed, observable, and discoverable is more important now than ever. It is valuable to think of data as a product to grow, nurture, and evolve. Its marriage with the right model or algorithm will bring true intelligence to organizations.
About the authors: Raghid El-Yafouri is a digital transformation strategist and technical consultant at Bottle Rocket Studios, assisting brands with their digital transformation, expansion strategies, and infrastructure optimization. You can find him at the intersection of technology and organizational efficiency. He thrives for constant innovation that amplifies a human’s intelligence rather than replaces it. As a MarTech veteran and HealthTech enthusiast, he believes that physicians and patients are underserved by what technology can offer in health informatics and medical records.
David Lance is a Solutions Architect at Bottle Rocket. With a career spanning over 20 years in the computer software industry, David is an expert in Enterprise Architecture, Software Architecture, Requirements Analysis, Strategic Planning, Agile Methodologies, and Databases. Most recently he has been leading on Bottle Rocket’s AI offerings.