Ingestion and Retrieval Flows

Knowledge lets you configure how data is ingested and how it is retrieved back at querytime using flows. Flows are a series of steps that can be configured via simple YAML files - so-called Flow Files or Flow Configs.

Ingestion Flows

An Ingestion Flow consists of 3 main parts (all of them are optional and have basic defaults):

Documentloader: The loader defines how input data/files are parsed and transformed to "LLM-readable" text.
Textsplitter: The textsplitter defines how the text coming from the loader is split into smaller parts (documents).
Transformers: Transformers can be used to modify the documents coming out of the textsplitter. They can e.g. add metadata, remove irrelevant documents or generate summaries for every document.

All the documents yielded by that flow will be sent to the Embeddings Model (globally defined, not yet configurable via flow file) and then stored in the vector database.

Retrieval Flows

A Retrieval Flow consists of 3 main parts (all of them are optional and have basic defaults):

QueryModifiers: QueryModifiers can be used to modify the input query before it is used for retrieval. They can even generate additional subqueries.
Retriever: The retriever defines how documents are retrieved from the vector database. E.g. one can use a routing retriever which automatically selects the most suitable dataset for the search.
Postprocessors: Postprocessors can be used to modify the retrieved documents. They can e.g. filter out irrelevant documents or sort the documents by relevance.

Retrieval Flow Architecture

Flow Config File - "Flows File"

A Flows File can define one or more named flows and can even assign flows to datasets, so you don't have to manually select them everytime.

Using Flows Files

Use the --flows-file flag to point to your flows file and optionally the --flow flag to select a specific flow from that file for your operation. If you don't specify a flow, either the dataset-assigned flow or, if there is no such assignment, the default flow will be used.

The --flows-file and --flow flags are available for the following commands:

ingest
retrieve
askdir

File Structure

A Flows File is a YAML file with the following structure:

flows: # top-level key (required) - Define your flows here
  foo: # name of the flow
    default: false # optional - Set this flow as the default flow
    ingestion: # define the ingestion flow
      # List of ignestion flows - Ingestion flows can be defined for different filetypes
      # Ingestion flows consist of a documentloader, textsplitter and one or more transformers
      # All components are optional and have basic defaults
      - filetypes: [".txt", ".md"]
        documentloader:
          name: plaintext
        textsplitter:
          name: markdown
        transformers:
          - name: filter_markdown_docs_no_content
          - name: extra_metadata
            options: # define advanced options for the extra_metadata transformer
              metadata:
                "foo": "bar"
      - filetypes: [".pdf"]
        documentloader:
          name: pdf
          options:
            maxPages: 5
            interpreterConfig:
              ignoreDefOfNonNameVals:
                - "CMapName"
    retrieval: # define the retrieval flow
      # There can be only one retrieval flow per top-level flow - there's no differentiation between filetypes
      # The retrieval flow consists of querymodifiers, a retriever and postprocessors
      # All components are optional and have basic defaults
      retriever:
        name: basic
        options:
          topK: 15
      postprocessors:
        - name: extra_metadata
          options:
            metadata:
              "spam": "eggs"
  bar: # another flow with only ingestion defined
    default: false
    ingestion:
      - filetypes: [ ".txt", ".md" ]
        documentloader:
          name: plaintext
        textsplitter:
          name: text
  baz: # another flow with only ingestion defined, which only differs from the `bar` flow in the chunkSize option
    default: false
    ingestion:
      - filetypes: [ ".txt", ".md" ]
        documentloader:
          name: plaintext
        textsplitter:
          name: text
          options:
            chunkSize: 4096

datasets: # top-level key (optional) - Assign flows to datasets
  foo: foo # dataset "foo" uses flow "foo"
  bar: bar
  baz: baz

Example Flows Files

You can find some example flow files in our GitHub repository.

Ingestion and Retrieval Flows

Ingestion Flows​

Retrieval Flows​

Retrieval Flow Architecture​

Flow Config File - "Flows File"​

Using Flows Files​

File Structure​

Example Flows Files​