
This is part of an ongoing series: see first post here.
Would you want your chatbot to start discussing Taylor Swift lyrics instead of providing tech support? That’s what our chatbot did when we violated the principle above. If you want to de-Swift your application and make your AI architecture safer, keep reading. (Sorry Taylor fans!)
Do you store your prompts with the rest of the code? Or load them from another source? Perhaps a combination of both? Below is the framework for thinking about this decision.
The first question you should ask is: Is there an immediate reason for storing the prompts separately from your code? If no, leave the prompts in Git with the rest of the codebase, where they belong. This is by far the easiest and safest setup to maintain. It is the default option.
Going back to Principle #1: Prompts are Code. Storing parts of your codebase outside Git is possible and sometimes necessary, but not trivial. Do not take the decision to move prompts out lightly.
What if some of your prompts need to be edited by non-Engineers? This could occur if deep domain expertise in an area is needed. Or if a prompt needs to be modified very frequently and you can’t wait on the engineering department.
In this case, you’ll need to load the prompt at runtime from a version controlled source. I’ve seen Confluence and Google Docs successfully used for this purpose. Many other version controlled, API accessible platforms are also available.
When planning the prompt loading logic, do not underestimate the amount of effort in adding this integration. You’ll need to handle a variety of error conditions and scenarios to have confidence in your application. Access permissions need to be configured and maintained, and automated testing and additional monitoring should be extended to catch errors as early as possible.
Here are some of the scenarios you need to plan for:
All of these issues are 100% solvable. But it's easy to fall into the pattern of thinking that loading a prompt from a Google Doc is a trivial operation that won’t affect the application architecture in a deep way. As I’ve shown above, loading an external prompt is serious business to be approached with care for high-reliability applications.
This is a bad idea, and you will regret it. The source of truth for the prompts needs to be version-controlled, have proper API and access controls. This is not an area to cut corners.
The hybrid approach combines storing some prompts directly within your codebase and loading others from external, version-controlled sources. While maintaining a unified location for all prompts is often simpler and more reliable, there are scenarios where a hybrid strategy can offer advantages.
Consider adopting a hybrid approach under conditions such as:
Guardrail prompts (also known as censor prompts) are specialized to screen responses before they reach users, ensuring appropriate, safe, and compliant outputs. Guardrails serve as a protective mechanism, particularly in applications where user interactions carry significant legal or ethical risks. They provide a second line of defense, catching inappropriate outputs that slip through.
Do not load guardrail prompts from an external doc - this adds a significant unnecessary risk. Either keep them in Git with your code or use a dedicated third party tool, such as Fiddle Guardrails. Guardrail logic doesn’t change very often, so this approach won’t slow you down all that much.
Using guardrails is a principle of its own, to be discussed in much more detail in a future post. It's a great pattern that improves the safety of your application and helps you sleep better at night. Just don’t load them from Google Docs.
Teams often load prompts externally to integrate them with evaluation engines, such as ML Flow. The underlying assumption behind this practice is that prompts are similar to ML models and need a detached, statistical assessment. You plug in a prompt, measure the F1 score on the output (or whatever metric you prefer) and iterate.
This approach is sometimes valid—for instance, on classification prompts designed to behave as ML models. But most prompts are fundamentally different: as outlined in Principle #1: LLM Prompts Are Code. Typical prompts are more similar to application logic than to ML models. They are more suited to Pass-Fail type evaluation together with the surrounding code, rather than a statistical evaluation approach.
External evaluation engines will not help you with most prompts. Instead, you should use automated AI-driven tests, similar to traditional unit tests. These are going to be the focus of subsequent posts.
Consider the following practices:
The central issue with loading prompts is availability - what should you do if the prompt doesn’t load when you expect it to.
This is what happened to us in the Taylor Swift example. None of the prompts for a tech support app loaded as a result of a Confluence credentials issue, including the guardrail prompt. This somehow didn’t trigger any runtime errors and the bot began responding without any instructions or input (since the input formatting string was part of the prompt). And what does OpenAI’s LLM want to talk to in the absence of input? Turns out—the lyrics to 'I Want to Break Free' by Queen and various Taylor Swift songs. Fortunately, this was caught and fixed almost immediately, and users enjoyed the music discussion—at least that’s what I tell myself.
Why did this incident occur? Two mistakes were made:
After the incident, the guardrail prompt was re-migrated to Git and exception logic was added to prevent deployment if a prompt failed to load or was invalid. You can save yourself a postmortem by following these recommendations proactively.
In this post, I examined key considerations around prompt storage and loading within AI applications. The default practice is to store your prompts alongside your code in version-controlled repositories. Only deviate from this when there's a compelling reason, such as frequent editing by non-engineers or specific evaluation requirements.
When prompts must be loaded externally, choose reliable, and strictly version-controlled sources, adding testing and monitoring for resilience. Guardrail prompts, given their critical role in application safety, should remain in your codebase to avoid severe reliability risks.
Most prompts are closer in nature to code than to ML models, so only use ML style tools where you need them. Don’t store all of your prompts externally just to simplify integration with an evaluation tool for a few of them.
If you enjoyed this post, follow the series for more insights.