Skip to main content


Natural Language Processing (NLP) is using machine learning techniques to work with text. Natural language is a language that has developed naturally in use (as contrasted with an artificial language or computer code).

Transfer learning

  • The default learning rate for Adam is usually way too high for transfer learning, one thing you can do is have a decaying learning rate for the optimizer. For example, you can use a learning rate based on a polynomial learning rate.

HuggingFace Course on fine-tuning a pre-trained model


Attention is all you need. The first transformer model was based on attention mechanisms. The transformer does not use recurrence and convolutions.

Types of Transformer Models:

  1. Encoders: In the context of Transformer models, encoders take an input sequence and convert it into a series of representations that capture the contextual information from the sequence. Each encoder operates in parallel, processing the entire sequence at once, and the output is a set of vectors with the same length as the input. BERT is a good example of this type of model. Tasks: Sentence classification, named entity recognition, extractive question answering

  2. Decoders: Decoders in Transformer models use the representations generated by the encoders to generate an output sequence. The decoders also operate in parallel, but they are autoregressive, meaning each output element is generated one at a time, using both the representations from the encoders and the previously generated elements. GPT-2 is a good example of this type of model. Tasks: Text generation

  3. Sequence-to-sequence: Sequence-to-sequence (seq2seq) is a concept in machine learning where a model is trained to convert sequences from one domain (input) to sequences in another domain (output). In the context of Transformer models, this often involves an encoder transforming an input sequence into a context-sensitive representation, and a decoder generating an output sequence from that representation.

    1. The encoder takes care of understanding the sequence.
    2. The decoder takes care of generating a sequence according to the understanding of the encoder. Tasks: Summarization, translation, generative question answering

Using transformers

The Transformers API in 🤗 provides a solution for handling multiple sequences and sequences of different lengths. Batching is used to send multiple sequences through the model at once, allowing efficient computation. Padding is employed to ensure that sequences within a batch have the same length, with a special padding token added to shorter sequences. Attention masks are used to instruct the model to ignore the padding tokens during computation.

Models have a maximum supported sequence length, typically around 512 or 1024 tokens. If longer sequences need to be processed, either a model with longer supported length can be used or the sequences can be truncated to fit within the maximum length. Truncation involves reducing the length of the sequences by removing tokens from the beginning or end.

An attention mask is a binary tensor used in Transformer models to indicate which tokens should be attended to (value of 1) and which tokens should be ignored (value of 0) during computation. It ensures that padding tokens or other irrelevant tokens do not influence the attention mechanism, allowing the model to focus on the meaningful parts of the input sequence.

The sequence length refers to the number of tokens in a given input sequence. It determines the length of the input tensors and influences the computational resources required for processing the sequence.

The batch size refers to the number of sequences that are processed simultaneously in parallel. It allows for efficient computation by utilizing parallelism and vectorized operations on modern hardware.

The hidden size refers to the dimensionality of the hidden representations in the Transformer model. It determines the size of the intermediate and output tensors throughout the model layers and is a key factor in the model's capacity to capture complex patterns and dependencies in the input data. The hidden size can be very large (768 is common for smaller models, and in larger models, this can reach 3072 or more).

HuggingFace Transformer Class

The HuggingFace transformer module can be used to do quick NLP tasks.

Examples of models that can be applied


Official LangChain Documentation

Langchain is a framework for Large Language Models (LLMs). It provides out-of-the-box functionality to build inference with LLMs. I have not delved too much into it as of yet.

The "Chain" in LangChain is because it provides a framework for "chaining" large language models together or with other components.


Outlines is a framework for parsing data and creating structured data from LLMs. The way it works is that things like regex act as a conditional check for matches and can also be used to generate templates for the LLM. LLMs work best when we can guide them in the prompt, and using tools like regex to filter the data makes the generation of data correct. This is an important feature that will probably dominate a lot of the work when it comes to LLMs.

I think it is very important to understand that less powerful models are not that useful in an outline context. Blog post from Outlines describing the background of Outline Outlines GitHub and docs Hacker News discussions Pydantic Example This is the instructor but is better i think than outlines.

Use Cases


Sentiment Analysis


Named Entity Recognition



  • HuggingFace actually uses the 🤗 emoji to refer to themselves. Kinda cute.
  • The HuggingFace tokenizer is based on Rust for speed.
  • HuggingFace both support Tensorflow and Pytorch, I wonder which one will win. Maybe we will have some way of standardizing the models into a new and better module, with standard components. Maybe this is the HuggingFace killer feature? It is also the JAX framework which I do not know that much about.
  • tf.keras.losses.SparceCategoricalCrossentropy()is the standard keras loss function for categorization.
  • A model is called zero-shot because you don’t need to fine-tune the model on your data to use it. It can directly return probability scores for any list of labels you want!
  • Think of LLms as mostly inscrutable artifacts, developing correspondingly sophisticated evaluations. - Andrej Karpathy