Learn Feature Extraction with BERT | Transfer Learning in NLP

Swipe to show menu

To use BERT as a feature extractor for sentiment analysis, you start by leveraging its pre-trained model to generate sentence embeddings—dense vector representations of your input text. These embeddings can then be used as features for downstream machine learning models, such as classifiers, without updating BERT’s weights. The typical workflow begins with preparing your text data and loading a pre-trained BERT model. You then tokenize your sentences, encode them using BERT, and extract the embeddings from one of BERT's layers (often the pooled output or the [CLS] token representation). These embeddings serve as rich, contextualized features that capture the semantic meaning of your sentences, which you can input into a classifier like logistic regression or support vector machines for sentiment prediction.

# Pseudocode for using BERT as a feature extractor (for illustration purposes)
from transformers import BertTokenizer, BertModel
import torch
from sklearn.linear_model import LogisticRegression

# Load the pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert = BertModel.from_pretrained('bert-base-uncased')

A pre-trained BERT model already contains rich contextual knowledge about language. The tokenizer converts raw text into tokens that match BERT’s vocabulary, while the model turns those tokens into dense vector representations. We use the uncased version of BERT, which ignores letter casing (so “Movie” and “movie” are treated the same).

# Example input sentences
sentences = ["I love this movie!", "This film was terrible."]

# Tokenize and encode sentences
inputs = tokenizer(sentences, return_tensors='pt', padding=True, truncation=True)

Each sentence is tokenized and padded to the same length so that they can be processed together in a single batch. The return_tensors='pt' argument makes the tokenizer return PyTorch tensors ready for the BERT model.

# Extract features with BERT (no gradient calculation needed)
with torch.no_grad():
    outputs = bert(**inputs)
    # Use the [CLS] token embedding as the sentence representation
    features = outputs.last_hidden_state[:, 0, :].numpy()

torch.no_grad() disables gradient computation because we are not training BERT — we’re only using it to produce embeddings. The [CLS] token (the first token of each input) is designed to represent the entire sentence, so its final hidden state serves as a compact feature vector for classification. The output tensor shape is (batch_size, hidden_size) — one vector per sentence.

# Use features as input for a classifier (e.g., logistic regression)
labels = [1, 0]  # 1 = positive, 0 = negative
clf = LogisticRegression()
clf.fit(features, labels)

These extracted embeddings can be used like any other numerical features. Here, a simple logistic regression classifier is trained to distinguish positive and negative sentiment using the BERT-derived sentence representations. This demonstrates how BERT can be used as a frozen feature extractor, providing rich language features without fine-tuning the full transformer model.

Note

When your dataset is small or training resources are limited, using BERT as a feature extractor is often more practical than fine-tuning. This approach allows you to benefit from BERT’s language understanding without the computational cost and risk of overfitting associated with updating all its parameters.

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 2

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 3. Chapter 2