Table of contents
This article marks the start of a series of in-depth pieces by Giuseppe Ciuni on the functional and technical aspects – explained in simple terms – of how artificial intelligence works. The author is an expert in digital technologies, a software developer and systems analyst; his organisation is based in southern Sicily and works with clients from all over the world, and here on the pages of Startupbusiness he shares his knowledge of the systems we commonly refer to as artificial intelligence.
What exactly are LLMs?
Over the past eighteen months, the adoption of AI within the company has moved from a trial phase to full-scale operational use.
Those working in the sector have gone through some very familiar stages: first, amazement: “What on earth is this? It’s like magic!”, then the practical question: “But is it possible to harness all this power and integrate it into my business or my product?”.
Since then, start-ups and businesses in general have begun to integrate AI – specifically large language models (LLMs) – into their products and processes.
Some integrations deliver tangible, measurable results; others do not. There is often just one deciding factor: a clear understanding of what an LLM actually does.
Neither a database nor a search engine
An LLM does not retrieve information from a repository, index documents like Elasticsearch, or query a database like PostgreSQL. What it does is rather more unusual, and understanding this completely changes the way we design AI systems. Let’s take a closer look.
An LLM is a statistical text generator
During pre-training, the model is exposed to trillions of tokens extracted from the web (web pages, books, code, forums, etc.) and trained to do just one thing: predict which token comes next given a context. Nothing else.
Andrej Karpathy, former head of AI at Tesla and one of the most influential researchers in the field, describes the resulting base model as a system that does not answer questions but completes text sequences in a way that is statistically consistent with what it observed during training.
This distinction has direct implications for what one should expect from an LLM model, and in particular from an LLM applied to a business context.
Microgpt, a GPT in 200 lines
In February 2026, Karpathy released Microgpt, a 200-line Python script with zero dependencies that implements the entire training and inference cycle of a GPT.
The project includes datasets, a tokeniser, a custom-built autograd engine, a simplified GPT-2 architecture, an Adam optimiser and a training loop.
The model is trained on 32,000 proper names and learns to generate new ones that are statistically plausible.
Everything you need to understand an LLM can be read in an afternoon and consists of the following parts:
Dataset + tokenisation + Transformer neural network + backpropagation + optimisation.
It should be noted that the difference between Microgpt and GPT-4 is of the order of magnitude in terms of parameters, data and computational operations, but the underlying algorithm is identical.
As you can see, an LLM isn’t magic: it’s applied mathematics on an industrial scale.
LLM: the tokeniser
“Raw” text cannot be fed directly into a neural network, as neural networks only process numbers; it needs to be converted. This conversion is carried out by the tokeniser.
The most obvious approach to conversion would be to process it character by character – “hello” (which becomes [c, i, a, o]) – or word by word.
However, both options have their drawbacks:
- the characters produce sequences that are too long and lose their structure
- Words create vast vocabularies, but they don’t handle rare words or unfamiliar technical terms very well.
We need to find a better solution that doesn’t compromise the structure and that compresses words intelligently, whilst keeping the size of the dictionaries under control. The byte-pair encoding algorithm comes to our aid.
Here is an example of how the BPE algorithm works:
- starts with the individual characters in a sentence;
- It iteratively identifies the pairs of symbols that appear most frequently in the text and merges them into a single token, for example:
“th” often appears in English → it becomes a token.
“the” appears even more frequently → it becomes a token.
- The process is repeated until the desired vocabulary size is reached.
Very common words (such as “the”, “home” and “cat”) are stored as unique tokens. Rare words, on the other hand, are split up.
This token-based representation has practical implications: the model does not ‘see’ words in the same way we do; certain seemingly trivial errors (such as miscounting the letters in a word) stem directly from this level of representation, not from the model’s capabilities.
Note: The most well-known LLMs (ChatGPT, Claude, Gemini, etc.) are trained on English-language datasets; for the same meaning and text length, Italian uses significantly more tokens than English (between 30% and 50% more).
It is therefore best to write the prompts in English.
A map of an LLM’s actual capabilities (what it does well)
LLMs are capable of generating text, processing data and performing many other tasks. However, we need to understand how to assign a quality metric (a form of ROI) to the results obtained.
Here are three examples where LLMs perform well in a business context:
Converting unstructured text into a structured format
Given a document, a contract, an email, etc., you want to extract structured data (such as JSON or a table). The model works well because it doesn’t need to ‘know’ anything beyond what is already in the text you provide: it reads what’s there, recognises patterns, and organises them into the format you’ve requested. This is by far the most reliable use case.
Draft generation
The model generates a first draft of a document (such as a technical specification or a business email) based on the structured input provided.
The result is not the final output but a draft that a human will need to review. The value of generating drafts lies in reducing the time required for human review from hours to minutes.
The essential requirement is that the process explicitly provides for human review, not as an option but as a mandatory step.
RAG
When querying an LLM, for example using proprietary company data, it is highly likely that you will receive an incorrect or ‘hallucinated’ response. The issue of hallucinated responses is not due to the quality of the model being used, but rather to the fact that the requested data is not present within the model (pre-training phase)
RAG tackles this problem at source: rather than asking the model to remember something – which it cannot possibly know – it is provided with the information at the very moment it is needed.
The mechanism is simple: the system searches the company’s knowledge base for documents relevant to the query, places them within the context of the conversation, and responds based solely on that material. The practical result is that hallucinations regarding proprietary domains are drastically reduced.
Potential mistakes that could blow the budget
The model used as an oracle
If you ask a model for information about internal company data (for example: up-to-date prices, proprietary technical specifications, stock availability), the model does not have access to this information.
During pre-training, it read billions of documents from the web: everything it learnt was compressed into its parameters, which remain fixed after training. It doesn’t know what happened yesterday, it isn’t familiar with the company’s data, and it doesn’t have access to your customer database.
The response it produces will be linguistically fluent and tonally confident (because it has learnt to use words, see Kharpathy’s words), but the content will be generated statistically, not retrieved from a real source.
This is how hallucinations work: it is not a bug, but the expected behaviour of a system that produces the statistically most plausible outcome even in the absence of actual information.
The solution is not to seek an absolutely more accurate model, but to design a sound architecture, as mentioned earlier.
Two established approaches that are commonly used are as follows:
- with RAG: rather than asking the model to recall information, the model is provided with the information at the moment it needs it. The system retrieves the relevant documents from a knowledge base, places them within the context of the conversation, and the model responds based solely on that material;
- with the ‘use’ tool: the model is equipped with tools it can call upon during generation: an API call to the back-end system, a database query, or a search. When a query about stock availability arrives, the model doesn’t ‘think’ about it; it calls the tool, receives the actual response and incorporates it. In both cases, the model does what it does best: reasoning about the text provided, structuring a response and extracting relevant information without being forced to invent what it doesn’t know.
Prompt without architecture
A prompt that works in a manual isn’t an integration; it’s just an experiment.
Deploying that prompt into production requires the same approach as with any software system: input validation, parsing and output validation, error handling, logging, and testing. Without a structured pipeline, the results remain potentially inconsistent and difficult to measure.
Choosing the wrong model for the task
Not all tasks require the most powerful and expensive models. For classification and structured extraction from short text, for example, there is no need to use the best LLM available; instead, locally run models can produce results comparable to the current top models at a fraction of the cost.
This approach has three advantages:
- lower costs
- privacy
- compliance GDPR
Three practical and replicable implementations using an LLM
Use case 1: automated helpdesk based on a proprietary knowledge base
What you’ll need:
- a small, low-cost cloud model, e.g. Claude Haiku, GPT-40 mini
- a RAG layer built on a vector database (Qdrant or pgvector on an existing Postgres instance), integrated into the existing support channel.
The measurable outcome is a reduction in the number of enquiries received by human operators regarding questions that have already been answered in the documentation. The ROI is calculated in man-hours saved per ticket. Realistic time to production: three to four weeks with a team that is already familiar with the infrastructure.
Use case 2: contract analysis pipeline
An in-house legal department or a law firm receives contracts to review:
What you’ll need:
- extract specific clauses (penalty clauses, automatic renewal, limitations of liability, jurisdiction), structure them into a validated JSON,
- produce a summary report.
This is a task where current models perform reliably provided the prompt is well-crafted and the output is validated against a schema. The cost of inference per document is in the region of a few pence. The time saved per document is equivalent to around half an hour of skilled labour.
Use case 3: automatic classification and routing of incoming documents.
A procurement department or any business function that receives large volumes of diverse documents (e.g. invoices, orders, etc.) can set up a workflow that classifies each incoming document, extracts the relevant fields and routes it to the correct process without human intervention.
What you’ll need:
- a cost-effective LLM model for classification and extraction (e.g. GPT-40 mini, Claude Haiku)
- an output validation layer to be compared with a JSON schema
- integration with the existing business management or document management system
The cost per document is in the region of a few pence.
Before making any investment, ask yourself three questions
- What is the specific task you wish to tackle, with precisely defined inputs and outputs?
- Can the output be verified automatically, or does it require human review?
- Does the value generated justify the cost of the model, as well as the costs of integration and maintenance over time?
Anyone who has clear answers to these three questions has already identified a potentially sound use case
Nel prossimo numero di ‘Inside the machine’ si illustrerà l’autograd, ossia il meccanismo usato nella rete neurale per imparare dagli errori, il concetto di gradiente, il modello base ed il modello instruct. (foto di Igor Omilaev su Unsplash
ALL RIGHTS RESERVED ©
