Sampling Methods =============== This library provides various sampling methods for language models. Each method has its own characteristics and is suitable for different use cases. Temperature Sampling ------------------ Temperature sampling adjusts the "sharpness" of the probability distribution: - Low temperature (< 1.0): More deterministic outputs, focusing on high-probability tokens - High temperature (> 1.0): More random outputs, flattens the distribution Temperature sampling works by dividing the logits by the temperature value before applying the softmax function: .. math:: p(x_i) = \frac{\exp(z_i/T)}{\sum_j \exp(z_j/T)} where :math:`z_i` are the logits and :math:`T` is the temperature. **When to use:** - For controlling the randomness/creativity of outputs - As a base sampling method that can be combined with others Top-K Sampling ------------ Top-K sampling restricts the sampling to only the K most likely tokens at each step, filtering out unlikely tokens. The algorithm: 1. Sort tokens by their probability 2. Keep only the top K tokens 3. Renormalize the probabilities of these K tokens 4. Sample from this reduced set **When to use:** - When you want to eliminate low-probability tokens - For more focused and coherent text generation - When you need a simple method to reduce randomness Top-P (Nucleus) Sampling --------------------- Top-P sampling (also known as nucleus sampling) keeps the smallest set of tokens whose cumulative probability exceeds a threshold p: 1. Sort tokens by decreasing probability 2. Keep adding tokens to the set until their cumulative probability exceeds p 3. Renormalize the probabilities of tokens in this set 4. Sample from this dynamic set **When to use:** - For a more adaptive approach than Top-K - To maintain diversity while removing very unlikely tokens - In scenarios where distribution varies significantly between steps Min-P Sampling ----------- Min-P sampling keeps all tokens whose probability is at least p * (probability of the most likely token): 1. Find the probability of the most likely token (p_max) 2. Keep all tokens whose probability is at least min_p * p_max 3. Renormalize probabilities of tokens in this set 4. Sample from this set **When to use:** - When the absolute probability matters more than relative ranking - For maintaining probability mass among relatively likely candidates - As an alternative to Top-P when you want a more relative threshold Anti-Slop Sampling --------------- Anti-Slop is a technique designed to improve the quality of generated text by detecting and preventing "slop" (low-quality, repetitive, or nonsensical content): 1. Apply backtracking at the word or phrase level when detecting low-quality outputs 2. Down-weight probabilities for problematic sequences 3. Retry with adjusted probabilities **When to use:** - For higher-quality text generation - To reduce repetition and nonsensical outputs - In applications where output quality is critical XTC (Exclude Top Choices) Sampling ------------------------------- XTC sampling nudges the model away from its most predictable choices by excluding a percentage of the top-weighted tokens: 1. Sort tokens by decreasing probability 2. Exclude the top N% of tokens (by probability mass) 3. Renormalize the remaining tokens 4. Sample from this set **When to use:** - To enhance creativity and diversity - When standard outputs are too predictable - For applications requiring novel or surprising content QAlign Sampling ------------ QAlign is a test-time alignment method that uses Markov Chain Monte Carlo (MCMC) to improve model outputs based on a reward model. This method is based on the research paper: **"Sample, Don't Search: Rethinking Test-Time Alignment for Language Models"** Gonçalo Faria, Noah A. Smith (2024) Paper: https://arxiv.org/abs/2504.03790 The algorithm works as follows: 1. Generate an initial sequence using the base language model 2. Perform MCMC steps with Metropolis-Hastings acceptance: a. Generate a proposal by resampling a portion of the sequence b. Compute rewards for current and proposed sequences c. Accept proposal with probability min(1, exp(β * (proposal_reward - current_reward))) 3. Return the final sequence after MCMC iterations Unlike other test-time optimization methods that search for a single optimal output, QAlign converges to sampling from the optimal aligned distribution for each prompt as compute scales. This prevents over-optimization of imperfect reward models. **When to use:** - For aligning model outputs with specific objectives without fine-tuning - When you have a reward model that can score text quality - To improve model performance on specific tasks at inference time - As an alternative to computationally expensive fine-tuning approaches Beam Search ---------- Beam search is a breadth-first search algorithm that maintains the top k most promising sequences at each step: 1. Start with the initial sequence 2. At each step: a. Generate all possible next tokens for each sequence b. Score each new sequence using log probabilities c. Keep only the top k sequences 3. Return the best sequences after reaching max_length The algorithm uses a beam width parameter to control how many sequences are maintained at each step. A larger beam width explores more possibilities but requires more computation. **When to use:** - For tasks requiring high-quality, deterministic outputs - When you need multiple diverse but high-probability sequences - In scenarios where finding the most likely sequence is important - For applications where you can afford the computational cost