Docs
Scores & Evaluation
Overview

Scores & Evaluation

Scores serve as a metric for evaluating individual executions or traces.

A variety of scores may be utilized, with the most common metrics assessing aspects such as quality, tonality, factual accuracy, completeness, and relevance, among others.

In instances where the score pertains to a specific phase of a trace, for example, a singular LLM call, a message in a chat conversation, or a step in an agent, it is possible to attach the score directly to the observation. This enables targeted evaluation of that particular component.

The Score object in Langfuse:

AttributeTypeDescription
namestringName of the score, e.g. user_feedback, hallucination_eval
valuenumberValue of the score
traceIdstringId of the trace the score relates to
observationIdstringOptional: Observation (e.g. LLM call) the score relates to
commentstringOptional: Evaluation comment, commonly used for user feedback, eval output or internal notes

Kinds of scores

Scores in Langfuse are adaptable and designed to cater to the unique requirements of specific LLM applications. They typically serve to measure the following aspects:

  • Quality
    • Factual accuracy
    • Completeness of the information provided
    • Verification against hallucinations
  • Style
    • Sentiment portrayed
    • Tonality of the content
    • Potential toxicity
  • Security
    • Similarity to prevalent prompt injections
    • Instances of model refusals (e.g., as a language model, ...)

This flexible scoring system allows for a comprehensive evaluation of various elements integral to the function and performance of the LLM application.

Ingesting scores

We currently run a private beta of our newest evaluation service on Langfuse Cloud. Learn more here and ping us via the chat widget if you are interested to join the beta.

Most users of Langfuse ingest scores programmatically. These are common sources of scores:

Sourceexamples
Manual evaluation (UI)Review traces/generations and add scores manually in the UI
User feedbackExplicit (e.g., thumbs up/down, 1-5 star rating) or implicit (e.g., time spent on a page, click-through rate, accepting/rejecting a model-generated output)
Model-based evaluationOpenAI Evals, Whylabs Langkit, Langchain Evaluators (cookbook), RAGAS for RAG pipelines (cookbook), custom model outputs
Custom via SDKs/APIRun-time quality checks (e.g. valid structured output format), custom workflow tool for human evaluation

Using scores across Langfuse

Scores can be used in multiple ways across Langfuse:

  1. Displayed on trace to provide a quick overview
  2. Segment all execution traces by scores to e.g. find all traces with a low quality score
  3. Analytics: Detailed score reporting with drill downs into use cases and user segments

Was this page useful?

Questions? We're here to help

Subscribe to updates