The Art of AI Text: How LLMs Choose Next Word using Greedy, Beam Search, and Sampling

Methods LLM models used to select next token based on some given information
Gen AI
LLM
Author

Sushobhon Karmakar

Published

July 1, 2025

Intorduction

Have you ever wondered how your phone magically suggests the next word as you type? Or how chatbots string together coherent sentences? The secret lies in something called next token generation.

Think of it this way: imagine telling a story one word at a time. After saying “The big,” what might come next? Perhaps “dog,” “house,” or “tree.” Next token generation teaches computers to do exactly this — predict the most likely word (or sometimes a piece of a word, called a “token”) that should follow the current sequence.

For instance, if you type “Thank you for your,” a next token generation model might suggest “help” as the most likely next word. It’s like having a super-smart autocomplete feature!

In this blog, we’ll pull back the curtain to explore the fascinating techniques behind this technology. We’ll dive into some code to see how it works and compare our results with the leading transformer libraries.

Get ready to explore the world of language prediction, where we’ll examine five popular methods that make next token generation possible: Greedy Search, Beam Search, Top-k Sampling, Top-p (Nucleus) Sampling, and Temperature Control. By the end, you’ll understand how machines learn to speak our language!

In this blog, we’ll use the gpt2-medium model to predict token probabilities, though you’re welcome to experiment with other models.

Temperature

Introducing randomness alone isn’t always sufficient for generating desired text. When the goal is to extract and present information based on a specific document, prioritizing tokens with the highest probability is often preferred for accuracy and coherence. However, when crafting creative content like blog posts, encouraging the model to explore more diverse and unexpected word choices can lead to richer and more engaging outputs.

This balance between predictability and creativity can be effectively controlled using a parameter called temperature. The temperature value adjusts the probability distribution of the predicted tokens. A low temperature makes the distribution sharper, increasing the likelihood of selecting high-probability tokens and thus resulting in more focused and deterministic output. Conversely, a high temperature flattens the probability distribution, giving lower-probability tokens a greater chance of being selected, thereby injecting more randomness and creativity into the generated text.

Let’s define a custom softmax function that incorporates this temperature parameter:

import torch.nn.functional as F

# Defining updated softmax for PyTorch tensors
def softmax_tensor(logits: torch.Tensor, temperature: float = 1.0) -> torch.Tensor:
    """
    Applies the softmax function to a PyTorch tensor along the last dimension,
    optionally with a temperature scaling factor.

    Args:
        logits: The input PyTorch tensor of logits.
        temperature: A scaling factor for the logits (default: 1.0).

    Returns:
        A PyTorch tensor of the same shape as the input, with probabilities
        along the last dimension.
    """
    return F.softmax(logits / temperature, dim=-1)

Updating Top-p Search function by including temperature parameter.

def top_p_search(prompt, max_length=50, p=1, temperature = 1, show_option = False):
    """
    Predict next word using top p Search with temperature.

    Args:
        input_text (str): Input text Sequence.
        max_length (int): Number of next token to predict.
        p (float): Probability Threshold.
        temperature (float): A number of control randomness.
        show_option (bool): If true shows top k tokens at each step.
    """
    model.eval()
    
    # Tokenize input and move to device
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)

    # output_tokens
    output_tokens = input_ids

    for _ in range(max_length):
      if show_option:
        print(f"\nFor {_+1} Token:")
        print("-"*100)

      # Get model predictions
      with torch.no_grad():
          outputs = model(output_tokens)
      
      # Extract logits for the last token and apply softmax
      logits = outputs.logits[:, -1, :]
      probs = softmax_tensor(logits= logits, temperature= temperature + 1e-6)  # Added temperature while calculating probability
      
      # Get top beam candidates
      probs, indices = torch.sort(probs, dim = 1, descending=True)
      cumulative_prob = torch.cumsum(probs[0], dim = 0)
      top_probs = probs[:,:torch.sum(cumulative_prob<=p).item() + 1]
      top_indices = indices[:,:torch.sum(cumulative_prob<=p).item() + 1]

      # Normalizing top K probabilities
      top_probs_norm = top_probs/torch.sum(top_probs)

      # Choosing an element randomly based on normalized probability of top k tokens
      selected_token_id = torch.multinomial(top_probs_norm[0], num_samples=1, replacement= True)
      selected_token = top_indices[0][selected_token_id.item()]
      
      # Reshape selected_token to have shape (1, 1)
      selected_token = selected_token.unsqueeze(0).unsqueeze(0) 

      # Appending selected  token with input token
      output_tokens = torch.cat((output_tokens, selected_token), dim=1)
      
      if show_option:
        # Printing Top K tokens
        print(f"Selected Token: {tokenizer.decode(selected_token.item())}\nTop K tokens are:\n")
        for index, probability in zip(top_indices.squeeze(0), top_probs_norm.squeeze(0)):
            print(f"{tokenizer.decode(index.item())} ({round(probability.item() * 100, 2)}%)")
      
    return tokenizer.decode(output_tokens[0], skip_special_tokens=True) 

# Example usage with very low temperature
input_text = "i love"
generated_text = top_p_search(input_text, max_length=10, p=0.7, temperature=0.1)  
print("-"*20 + "\nFinal generated Text is:\n" + generated_text + "\n" + "-"*20)

Tokens generated with very low temperature of the model:

--------------------
Final generated Text is:
i love the idea of a "super-hero" who
--------------------

Consequently, as you can observe, the initial few tokens closely mirror the results obtained with our earlier deterministic approach. Let’s now explore the effect of increasing the temperature.

input_text = "i love"
generated_text = top_p_search(input_text, max_length=10, p=0.7, temperature=1)  
print("-"*20 + "\nFinal generated Text is:\n" + generated_text + "\n" + "-"*20)
--------------------
Final generated Text is:
i love you no matter what," Trump wrote in a January
--------------------

Upon increasing the temperature to 1, we observe that the model begins to select more varied and less predictable words. However, it’s important to note that employing excessively high temperature values can lead to generated text that lacks coherence and meaning. A generally recommended and effective range for the temperature parameter is between 0 and 1.

input_text = "i love"
generated_text = top_p_search(input_text, max_length=10, p=0.7, temperature=0.7)  
print("-"*20 + "\nFinal generated Text is:\n" + generated_text + "\n" + "-"*20)
--------------------
Final generated Text is:
i love you so much.

MORGAN:
--------------------

Text generated with temperature 0.7 feel more natural. (Try playing with different temperature value)

You can also see the list of all possible words by changing show_option parameter to True at different temperature value. ( Try playing with that as well, and see if you observe any pattern among the number of words and temperature value.)

A special thank you to Rohan-Paul-AI for the inspiration behind this post, and thanks also to Koushik Khan for encouraging me to write it!