Data Designer SDK

πŸ‘‹ Hi! Welcome to the DataDesigner SDK documentation.

DataDesigner is a general-purpose framework for designing and generating synthetic datasets from scratch. It provides a simple interface that lets developers describe the attributes of the dataset they want and iterate on the generated data through fast previews and detailed evaluations.

Getting Started

The best way to learn how to use DataDesigner is by example.

To get started, here is a simple example of how to generate a product review dataset.

from gretel_client.navigator_client import Gretel

# We import AIDD column and parameter types using this shorthand for convenience.
import gretel_client.data_designer.params as P
import gretel_client.data_designer.columns as C

gretel = Gretel()

# Initialize a new Data Designer instance using the `data_designer` factory.
# We use the "apache-2.0" model suite to generate data with a permissive license.
aidd = gretel.data_designer.new(model_suite="apache-2.0")

##############################################################
# Add Sampler columns to our data design.
##############################################################

aidd.add_column(
    C.SamplerColumn(
        name="product_category",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
            values=["Electronics", "Clothing", "Home & Kitchen", "Books", "Home Office"],
        )
    )
)

# This column will sample synthetic person data based on the US Census.
aidd.add_column(
    C.SamplerColumn(
        name="customer",
        type=P.SamplerType.PERSON,
        params=P.PersonSamplerParams(age_range=[18, 70])
    )
)

aidd.add_column(
    C.SamplerColumn(
        name="number_of_stars",
        type=P.SamplerType.UNIFORM,
        params=P.UniformSamplerParams(low=1, high=5),
        convert_to="int"
    )
)

##############################################################
# Add LLM-generated columns to our data design.
##############################################################

aidd.add_column(
    C.LLMTextColumn(
        name="product_name",
        # All columns in the dataset are accessible in the prompt template.
        prompt=(
            "Come up with a creative product name for a product in the '{{ product_category }}' category. "
            "Respond with only the product name, no other text."
        ),
        # This is optional but can be useful for controlling the LLM's behavior. Do not include instructions
        # related to output formatting in the system prompt, as AIDD handles this based on the column type.
        system_prompt=(
            "You are a helpful assistant that generates product names. You respond with only the product name, "
            "no other text. You do NOT add quotes around the product name. "
        )
    )
)

aidd.add_column(
    C.LLMTextColumn(
        name="customer_review",
        # Note the nested JSON of the customer column is accessible using dot notation.
        prompt=(
            "You are a customer named {{ customer.first_name }} from {{ customer.city }}, {{ customer.state }}. "
            "You are {{ customer.age }} years old and recently purchased a product called {{ product_name }}. "
            "Write a review of this product, which you gave a rating of {{ number_of_stars }} stars. "
        ),
    )
)

##############################################################
# Generate your dataset!
##############################################################

# Generate 10 preview records in real time for fast iteration.
preview = aidd.preview()

# Create a new dataset with an evaluation report.
workflow_run = aidd.with_evaluation_report().create(
    num_records=100,
    name="aidd-sdk-101-product-reviews",
    wait_until_done=True
)

# Download the dataset to a pandas DataFrame.
df = workflow_run.dataset.df

πŸ—‚οΈ API Reference

The DataDesigner class serves as a high-level interface for building datasets from scratch.

class gretel_client.data_designer.data_designer.DataDesigner

High-level interface for building datasets with AI Data Designer.

Instances of this class should always be created using the DataDesignerFactory as an attribute of the Gretel object.

Example:

from gretel_client import Gretel

gretel = Gretel(api_key="prompt")

# Create a DataDesigner instance from scratch
aidd = gretel.data_designer.new()

# Create a DataDesigner instance from a configuration file
aidd = gretel.data_designer.from_config("path/to/config.yaml")
property allowed_references: list[str]

All referenceable variables allowed in prompt templates and expressions.

property config: AIDDConfig

The current configuration object of this Data Designer instance.

property model_suite: ModelSuite

The current model suite.

property model_configs: list[ModelConfig] | None

The current model configurations.

get_column(column_name: str) AIDDColumnT | None

Returns the column object with the given name.

get_columns_of_type(column_type: Type) list[AIDDColumnT]

Returns all columns of the given type.

delete_column(column_name: str) Self

Deletes the column with the given name.

add_column(column: AIDDColumnT | None = None, *, name: str | None = None, type: ColumnProviderTypeT = ProviderType.LLM_TEXT, **kwargs) Self

Add AIDD column to the current Data Designer configuration.

If no column object is provided, you must provide the name, type, and any additional keyword arguments that are required by the column constructor. For each column type, you can directly access their constructor parameters by importing from the params module: gretel_client.data_designer.params.

Parameters:
  • column – AIDD column object to add.

  • name – Name of the column to add. This is only used if column is not provided.

  • type – Column type to add. This is only used if column is not provided.

  • **kwargs – Additional keyword arguments to pass to the column constructor.

Returns:

The current Data Designer instance.

get_constraints(target_column: str) list[ColumnConstraint]

Returns the constraints for the given target column.

delete_constraints(target_column: str) Self

Deletes the constraints for the given target column.

add_constraint(target_column: str, type: ConstraintType, params: dict[str, str | float] | ColumnConstraintParams) Self

Add a constraint to the current Data Designer configuration.

Currently, constraints are only supported for numerical samplers. The type must be one of:

  • β€œscalar_inequality”: Constraint between a column and a scalar value.

  • β€œcolumn_inequality”: Constraint between two columns.

The params must be a dictionary of ColumnConstraintParams object with the following keyword arguments:

  • β€œrhs”: The right-hand side of the inequality.

  • β€œoperator”: One of the following inequality operators:

    • β€œgt”: Greater than.

    • β€œge”: Greater than or equal to.

    • β€œlt”: Less than.

    • β€œle”: Less than or equal to.

Parameters:
  • target_column – The column that the constraint applies to.

  • type – Type of constraint to add.

  • params – Parameters for the constraint.

Returns:

The current Data Designer instance.

get_evaluation_report() GeneralDatasetEvaluation | None

Returns the dataset evaluation report configuration if one is set.

delete_evaluation_report() Self

Deletes the current dataset evaluation report configuration.

with_evaluation_report(settings: EvaluateDataDesignerDatasetSettings | None = None) Self

Add an evaluation report to the current Data Designer configuration.

Parameters:

settings – Evaluation report settings.

Returns:

The current Data Designer instance.

preview(verbose_logging: bool = False, validate: bool = True) PreviewResults

Generate a preview of the dataset.

The preview consists of 10 records generated from the current configuration. This is a quick way to check that the configuration is working as expected before generating a larger dataset.

Parameters:
  • verbose_logging – Whether to enable verbose logging.

  • validate – If True, run semantic validation on the configuration before generating a preview. This is recommended to catch issues like invalid references in prompt templates, which otherwise would only be caught during at runtime.

Returns:

Preview results object.

create(*, num_records: int, name: str | None = None, run_name: str | None = None, wait_until_done: bool = False) WorkflowRun

Create a new dataset based on the current Data Designer configuration.

This method creates a persistent workflow and runs it as a batch job in a managed service. Unlike preview, this creates a permanent record of the workflow execution that can be referenced later.

Parameters:
  • num_records – Number of records to generate.

  • name – Name of the workflow.

  • run_name – Name of a specific workflow run. This name is useful as a label for a particular workflow run when you plan to run the same workflow multiple times (potentially with different configurations).

  • wait_until_done – Block until the workflow has completed running. If False, immediately returns the WorkflowRun object.

Returns:

WorkflowRun object.

with_person_samplers(person_samplers: dict[str, PersonSamplerParams], *, keep_person_columns: bool = False) Self

Define latent person samplers that will be dropped at the end of the workflow.

Person samplers defined with this method are latent in the sense that they give you access to person objects with attributes that can be referenced by other columns, but the objects themselves are dropped from the final dataset. This is useful when you just need access to certain person attributes but don’t need the entire object in the final dataset.

If you want to keep the person sampler columns in the final dataset, you have two options. You can either set keep_person_columns to True or you can add person samplers as columns using the add_column method.

Parameters:
  • person_samplers – Dictionary of person sampler parameters. The keys are the names of the person samplers and the values are the parameters for each sampler.

  • keep_person_columns – If True, keep the person sampler columns in the final dataset.

Returns:

The current Data Designer instance.

with_seed_dataset(dataset: DataFrame | Path | str | File, sampling_strategy: SamplingStrategy = SamplingStrategy.ORDERED, with_replacement: bool = False) Self

Define a dataset to seed the synthetic data generation process.

Each row of the seed dataset is treated as a single example. The columns of the seed dataset can be referenced by other columns in prompt templates and/or expressions.

The seed data will be sampled using one of the following strategies:

  • β€œordered”: Maintains the order of the rows in the seed dataset.

  • β€œshuffle”: Randomly shuffles the rows of the seed dataset.

Parameters:
  • dataset – DataFrame, Path, or File object.

  • sampling_strategy – Sampling strategy to use.

  • with_replacement – If True, the same row can be sampled multiple times.

validate() Self

Validate the current Data Designer configuration.

This method runs task-level validation on the current configuration and β€œsemantic” validation, which runs a wholistic check of the full schema for issues like references to undefined columns or inconsistent settings between related columns.

Returns:

The current Data Designer instance.

property seed_columns: list[gretel_client.data_designer.types.DataSeedColumn]

Columns from the seed dataset, if one is defined.

property sampler_columns: list[SamplerColumn]

Columns that use a sampler to generate data.

property llm_gen_columns: list[gretel_client.data_designer.types.LLMGenColumn]

Columns that use an LLM to generate data.

property llm_text_columns: list[LLMTextColumn]

Columns that use an LLM to generate text data.

property llm_code_columns: list[LLMCodeColumn]

Columns that use an LLM to generate code.

property llm_structured_columns: list[LLMStructuredColumn]

Columns that use an LLM to generate structured data.

property llm_judge_columns: list[LLMJudgeColumn]

Columns that use an LLM to judge the quality of generated data.

property code_validation_columns: list[CodeValidationColumn]

Columns with results from validation of columns with code.

property expression_columns: list[ExpressionColumn]

Columns that generate data from a jinja2 expression.

property workflow_manager: WorkflowManager

Workflow manager for the current Data Designer instance.

πŸ›οΈ Column Types

To craft your data design, you can select from the following column types.

class gretel_client.data_designer.columns.SamplerColumn(*, name: str, type: SamplerType, params: SubcategorySamplerParams | CategorySamplerParams | DatetimeSamplerParams | PersonSamplerParams | TimeDeltaSamplerParams | UUIDSamplerParams | BernoulliSamplerParams | BernoulliMixtureSamplerParams | BinomialSamplerParams | GaussianSamplerParams | PoissonSamplerParams | UniformSamplerParams | ScipySamplerParams, conditional_params: Dict[str, SubcategorySamplerParams | CategorySamplerParams | DatetimeSamplerParams | PersonSamplerParams | TimeDeltaSamplerParams | UUIDSamplerParams | BernoulliSamplerParams | BernoulliMixtureSamplerParams | BinomialSamplerParams | GaussianSamplerParams | PoissonSamplerParams | UniformSamplerParams | ScipySamplerParams] | None = {}, convert_to: str | None = None, drop: bool = False, **extra_data: Any)

AIDD column that uses a sampler to generate data.

Sampler columns can be conditioned on other sampler columns using the conditional_params argument, which is a dictionary of conditions and parameters. Conditions are specified as strings involving the names of other sampler columns and the operators ==, !=, >, >=, <, <=.

Parameters:
  • name – Name of the column.

  • type – Type of sampler to use.

  • params – Parameters for the sampler. If conditional_params are provided, these parameters will be used as the default when no condition is met.

  • conditional_params – Conditional parameters for the sampler. The keys of the dict are conditions from other columns, and the values are the parameters for the sampler.

  • convert_to – Optional data conversion to apply to the generated data. For numerical columns this can be β€œint” or β€œfloat”, and for datetime columns, this can be a datetime format string (e.g. β€œ%Y/%m/%d”).

Example:

from gretel_client.navigator_client import Gretel
from gretel_client.data_designer.columns import SamplerColumn
from gretel_client.data_designer.params import (
    GaussianSamplerParams,
    CategorySamplerParams,
    SamplerType,
)

aidd = Gretel(api_key="prompt").data_designer.new()

aidd.add_column(
    SamplerColumn(
        name="age",
        type=SamplerType.GAUSSIAN,
        params=GaussianSamplerParams(mean=35, stddev=5),
        convert_to="int",
    )
)

aidd.add_column(
    SamplerColumn(
        name="pet_type",
        type=SamplerType.CATEGORY,
        params=CategorySamplerParams(values=["dog", "cat", "bird"]),
        conditional_params={
            "age < 20": CategorySamplerParams(values=["rabbit", "hamster"]),
        },
    )
)
classmethod unpack(column: SerializableConditionalDataColumn | dict) Self

This can be used to unpack the true base type.

class gretel_client.data_designer.columns.LLMTextColumn(*, num_records: int | None = 100, model_suite: str | None = 'apache-2.0', error_rate: float | None = 0.2, model_configs: List[ModelConfig] | None = None, model_alias: str | ModelAlias = ModelAlias.TEXT, prompt: str, name: str | None = 'response', system_prompt: str | None = None, output_type: OutputType = OutputType.TEXT, output_format: str | Dict[str, Any] | None = None, description: str | None = '', drop: bool = False, **extra_data: Any)

AIDD column that uses an LLM to generate text.

Parameters:
  • name – Name of the column.

  • prompt – Prompt template to use for generation.

  • system_prompt – System prompt for the LLM. Useful for defining the LLM’s role, tone, and other instructions. However, do not provide any instructions related to the output format, as this is handled internally by AIDD.

  • model_alias – Model alias to use for the LLM. Defaults to ModelAlias.TEXT.

class gretel_client.data_designer.columns.LLMCodeColumn(*, num_records: int | None = 100, model_suite: str | None = 'apache-2.0', error_rate: float | None = 0.2, model_configs: List[ModelConfig] | None = None, model_alias: str | ModelAlias = ModelAlias.CODE, prompt: str, name: str | None = 'response', system_prompt: str | None = None, output_type: OutputType = OutputType.CODE, output_format: str | Dict[str, Any] | None = None, description: str | None = '', drop: bool = False, **extra_data: Any)

AIDD column that uses an LLM to generate code.

Parameters:
  • name – Name of the column.

  • prompt – Prompt template to use for generation.

  • system_prompt – System prompt for the LLM. Useful for defining the LLM’s role, tone, and other instructions. However, do not provide any instructions related to the output format, as this is handled internally by AIDD.

  • model_alias – Model alias to use for the LLM. Defaults to ModelAlias.CODE.

class gretel_client.data_designer.columns.LLMStructuredColumn(*, num_records: int | None = 100, model_suite: str | None = 'apache-2.0', error_rate: float | None = 0.2, model_configs: List[ModelConfig] | None = None, model_alias: str | ModelAlias = ModelAlias.STRUCTURED, prompt: str, name: str | None = 'response', system_prompt: str | None = None, output_type: OutputType = OutputType.STRUCTURED, output_format: str | Dict[str, Any] | None = None, description: str | None = '', drop: bool = False, **extra_data: Any)

AIDD column that uses an LLM to generate structured data.

Parameters:
  • name – Name of the column.

  • prompt – Prompt template to use for generation.

  • system_prompt – System prompt for the LLM. Useful for defining the LLM’s role, tone, and other instructions. However, do not provide any instructions related to the output format, as this is handled internally by AIDD.

  • model_alias – Model alias to use for the LLM. Defaults to ModelAlias.STRUCTURED.

class gretel_client.data_designer.columns.LLMJudgeColumn(*, model_suite: str | None = 'apache-2.0', error_rate: float | None = 0.2, model_configs: List[ModelConfig] | None = None, model_alias: str | ModelAlias | None = 'judge', prompt: str, num_samples_to_judge: int | None = 100, rubrics: List[Rubric], name: str, judge_random_seed: int | None = 2025, drop: bool = False, **extra_data: Any)

AIDD column for llm-as-a-judge with custom rubrics.

Parameters:
  • name – Name of the column.

  • prompt – Prompt template to use for llm-as-a-judge.

  • rubrics – List of rubrics to use for evaluation.

  • num_samples_to_judge – Number of samples to judge. If None, the full dataset will be judged. If less than the total number of rows in the dataset, a random sample of the specified size will be judged.

  • model_alias – Model alias to use for the LLM. Defaults to ModelAlias.JUDGE.

class gretel_client.data_designer.columns.CodeValidationColumn(*, drop: bool = False, name: str, code_lang: CodeLang, target_column: str, **extra_data: Any)

AIDD column for validating code in another column.

Code validation is currently supported for Python and SQL.

Parameters:
  • name – Name of the column.

  • code_lang – Language of the code to validate.

  • target_column – Column with code to validate.

class gretel_client.data_designer.columns.ExpressionColumn(*, num_records: int | None = 100, name: str, expr: str, dtype: Dtype | None = 'str', drop: bool = False, **extra_data: Any)

AIDD column for generated data based on jinja2 expressions.

Parameters:
  • name – Name of the column.

  • expr – Expression to use for generation.

  • dtype – Data type of the column. Can be β€œstr” (default), β€œint”, β€œfloat”, or β€œbool”.

🎲 Samplers

Samplers are an important concept in DataDesigner for non-LLM based synthetic data generation.

class gretel_client.workflows.configs.tasks.SamplerType
BERNOULLI = 'bernoulli'
BERNOULLI_MIXTURE = 'bernoulli_mixture'
BINOMIAL = 'binomial'
CATEGORY = 'category'
DATETIME = 'datetime'
GAUSSIAN = 'gaussian'
PERSON = 'person'
POISSON = 'poisson'
SCIPY = 'scipy'
SUBCATEGORY = 'subcategory'
TIMEDELTA = 'timedelta'
UNIFORM = 'uniform'
UUID = 'uuid'
class gretel_client.data_designer.params.BernoulliMixtureSamplerParams(*, p: float, dist_name: str, dist_params: Dict[str, Any], **extra_data: Any)
class gretel_client.data_designer.params.BernoulliSamplerParams(*, p: float, **extra_data: Any)
class gretel_client.data_designer.params.BinomialSamplerParams(*, n: int, p: float, **extra_data: Any)
class gretel_client.data_designer.params.CategorySamplerParams(*, values: List[str | int | float], weights: List[float] | None = None, **extra_data: Any)
class gretel_client.data_designer.params.DatetimeSamplerParams(*, start: str, end: str, unit: Unit | None = 'D', **extra_data: Any)
class gretel_client.data_designer.params.GaussianSamplerParams(*, mean: float, stddev: float, **extra_data: Any)
class gretel_client.data_designer.params.GenerationParameters(*, temperature: float | UniformDistribution | ManualDistribution | None = None, top_p: float | UniformDistribution | ManualDistribution | None = None, **extra_data: Any)
class gretel_client.data_designer.params.ManualDistribution(*, distribution_type: DistributionType | None = 'manual', params: ManualDistributionParams, **extra_data: Any)
class gretel_client.data_designer.params.ManualDistributionParams(*, values: List[float], weights: List[float] | None = None, **extra_data: Any)
class gretel_client.data_designer.params.ModelConfig(*, alias: str, model_name: str, generation_parameters: GenerationParameters, **extra_data: Any)
class gretel_client.data_designer.params.PersonSamplerParams(*, locale: str | None = 'en_US', sex: Sex | None = None, city: str | List[str] | None = None, age_range: List[int] | None = [18, 114], state: str | List[str] | None = None, **extra_data: Any)
class gretel_client.data_designer.params.PoissonSamplerParams(*, mean: float, **extra_data: Any)
class gretel_client.data_designer.params.ScipySamplerParams(*, dist_name: str, dist_params: Dict[str, Any], **extra_data: Any)
class gretel_client.data_designer.params.SubcategorySamplerParams(*, category: str, values: Dict[str, List[str | int | float]], **extra_data: Any)
class gretel_client.data_designer.params.TimeDeltaSamplerParams(*, dt_min: int, dt_max: int, reference_column_name: str, unit: Unit1 | None = 'D', **extra_data: Any)
class gretel_client.data_designer.params.UUIDSamplerParams(*, prefix: str | None = None, short_form: bool | None = False, uppercase: bool | None = False, **extra_data: Any)
class gretel_client.data_designer.params.UniformDistribution(*, distribution_type: DistributionType | None = 'uniform', params: UniformDistributionParams, **extra_data: Any)
class gretel_client.data_designer.params.UniformDistributionParams(*, low: float, high: float, **extra_data: Any)
class gretel_client.data_designer.params.UniformSamplerParams(*, low: float, high: float, **extra_data: Any)

βœ‹ Constraints

Constraints are used to ensure that data generated by samplers follows any relevant business rules or constraints.

class gretel_client.workflows.configs.tasks.ConstraintType
COLUMN_INEQUALITY = 'column_inequality'
SCALAR_INEQUALITY = 'scalar_inequality'
class gretel_client.workflows.configs.tasks.InequalityOperator
GE = 'ge'
GT = 'gt'
LE = 'le'
LT = 'lt'
class gretel_client.workflows.configs.tasks.ColumnConstraint(*, target_column: str, type: ConstraintType, params: ColumnConstraintParams, **extra_data: Any)
class gretel_client.workflows.configs.tasks.ColumnConstraintParams(*, operator: InequalityOperator, rhs: float | str, **extra_data: Any)