top of page

The NonBioS LLM Observability Pick

4 days ago

4 min read

1

14

0

When you start running inferencing at scale, and at NonBioS, we do a lot of it, the immediate problem is analyzing individual inferencing calls. The first solution is to use your log files and 'grep' (find) whatever you are looking for. This becomes a chore very soon and longer inferencing sessions are harder to trace. The next obvious solution is to put all calls in a database and use SQL to comb through it all. Though viable, it lacks specialization for inference-specific data and workflows. The SaaS option to solve this problem is what is called LLM Observability.


LLM Observability software allows you to understand how your AI application interacts with the underlying inferencing system, which can include one or more large language models. The primary objective is to map application behavior to specific LLM calls for analysis and optimization over time. LLM Observability software additionally helps you track and understand response times (how quickly the model generates answers), token usage (how much computational effort/cost each response requires), and error rates (when and why the model might fail).


Advanced LLM Observability software features include tracking user feedback for LLM call sequences or sessions, and replaying sessions with updated models or prompts to identify improvements. In LLM observability, sessions refer to logical groupings of related inference calls: for example, a user conversation with multiple turns or a coding assistant helping through multiple iterations.


Prompt management is also now being offered by various LLM SaaS offerings. This is useful as prompts evolve rapidly in AI applications, allowing them to be iterated and optimized through a user interface while analyzing historical sessions.


At NonBioS, we conducted an exhaustive study of various LLM Observability software providers. There are a good dozen offerings out there, including Langsmith, Lunary, Phoenix Arize, Portkey, Datadog, Langfuse, and Helicone.


Langsmith appeared popular, but we had encountered challenges with Langchain from the same company, finding it overly complex for previous NonBioS tooling. We rewrote our systems to remove dependencies on Langchain and chose not to proceed with Langsmith as it seemed strongly coupled with Langchain.


We looked at Phoenix Arize very briefly, but the naming was a bit confusing. Is it Phoenix, or is it Arize? The website also seemed challenging from a usability perspective, and we were left wondering if that would translate to the tool as well.


We narrowed down our choices to Langfuse and Helicone. Both appeared solid, but we ultimately chose Langfuse. The key considerations that drove us to choose Langfuse are as follows:


Firstly, we were looking for mature, stable software. A 5-minute run of the NonBioS AI Engineer can consume over a million tokens and a hundred inferencing calls. We needed stable software to handle this volume of data. We prioritized stability over the latest features.


Secondly, we required on-premises deployment for data security. We handle sensitive customer data requiring strict data governance so this was non-negotiable.


Thirdly, we were looking for an open-source solution. Our needs are very specific and are already beginning to exceed even what the latest LLM Observability tools have to offer. Open source allows us to extend this software for our needs and hopefully contribute back to the community in the future.


Both Langfuse and Helicone provide a competitive stack from a feature perspective. One significant factor in our decision to choose Langfuse was that it currently has 5.6k stars on GitHub, while Helicone has 1.7k stars.


Session tracking capabilities significantly impacted our decision. Langfuse offers a tracing feature that connects sequences of LLM calls to single sessions. This aligned well with our needs and appeared straightforward in initial testing.


For prompt management, both tools offer different approaches. Langfuse offers a robust version control system for prompts. Helicone's prompt feature is still in beta, and the free version has limitations on the number of prompts that can be managed.


In terms of evaluation features, Langfuse includes a scoring capability in its free version. We did not find a similar feature in Helicone. The scoring capability allows user feedback on individual LLM calls. This feedback can then be analyzed, such as mapping user experience to specific prompts and models.


Another notable feature of Langfuse is the use of a model as a judge. This feature allows you to assign an LLM model to judge the responses of two different sets of LLM calls, which might differ in the prompt or the underlying model used. This can be very helpful in optimizing your inferencing setup. However, this is not enabled in the free version/self-hosted version.


Both Helicone and Langfuse have a playground feature, which allows you to invoke and iterate on specific LLM calls directly through the UI, but this was restricted to paid tiers in both tools.


Overall, Langfuse appeared more developed and suited to our requirements. It uses PostgreSQL in the backend, which aligned with our existing infrastructure. The default dark theme was visually challenging but can be easily switched to a light one.


Note: Our evaluation was limited and may not cover all features exhaustively. Both platforms are actively developing, and features may have changed since our assessment.

4 days ago

4 min read

1

14

0

Comments

Share Your ThoughtsBe the first to write a comment.