Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Leer Feature Extraction with BERT | Transfer Learning in NLP
Transfer Learning Essentials

bookFeature Extraction with BERT

To use BERT as a feature extractor for sentiment analysis, you start by leveraging its pre-trained model to generate sentence embeddings—dense vector representations of your input text. These embeddings can then be used as features for downstream machine learning models, such as classifiers, without updating BERT’s weights. The typical workflow begins with preparing your text data and loading a pre-trained BERT model. You then tokenize your sentences, encode them using BERT, and extract the embeddings from one of BERT's layers (often the pooled output or the [CLS] token representation). These embeddings serve as rich, contextualized features that capture the semantic meaning of your sentences, which you can input into a classifier like logistic regression or support vector machines for sentiment prediction.

# Pseudocode for using BERT as a feature extractor (for illustration purposes)
from transformers import BertTokenizer, BertModel
import torch
from sklearn.linear_model import LogisticRegression

# Load the pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert = BertModel.from_pretrained('bert-base-uncased')

A pre-trained BERT model already contains rich contextual knowledge about language. The tokenizer converts raw text into tokens that match BERT’s vocabulary, while the model turns those tokens into dense vector representations. We use the uncased version of BERT, which ignores letter casing (so “Movie” and “movie” are treated the same).

# Example input sentences
sentences = ["I love this movie!", "This film was terrible."]

# Tokenize and encode sentences
inputs = tokenizer(sentences, return_tensors='pt', padding=True, truncation=True)

Each sentence is tokenized and padded to the same length so that they can be processed together in a single batch. The return_tensors='pt' argument makes the tokenizer return PyTorch tensors ready for the BERT model.

# Extract features with BERT (no gradient calculation needed)
with torch.no_grad():
    outputs = bert(**inputs)
    # Use the [CLS] token embedding as the sentence representation
    features = outputs.last_hidden_state[:, 0, :].numpy()

torch.no_grad() disables gradient computation because we are not training BERT — we’re only using it to produce embeddings. The [CLS] token (the first token of each input) is designed to represent the entire sentence, so its final hidden state serves as a compact feature vector for classification. The output tensor shape is (batch_size, hidden_size) — one vector per sentence.

# Use features as input for a classifier (e.g., logistic regression)
labels = [1, 0]  # 1 = positive, 0 = negative
clf = LogisticRegression()
clf.fit(features, labels)

These extracted embeddings can be used like any other numerical features. Here, a simple logistic regression classifier is trained to distinguish positive and negative sentiment using the BERT-derived sentence representations. This demonstrates how BERT can be used as a frozen feature extractor, providing rich language features without fine-tuning the full transformer model.

Note
Note

When your dataset is small or training resources are limited, using BERT as a feature extractor is often more practical than fine-tuning. This approach allows you to benefit from BERT’s language understanding without the computational cost and risk of overfitting associated with updating all its parameters.

question mark

Which of the following are benefits of using BERT as a feature extractor instead of fine-tuning?

Select the correct answer

Was alles duidelijk?

Hoe kunnen we het verbeteren?

Bedankt voor je feedback!

Sectie 3. Hoofdstuk 2

Vraag AI

expand

Vraag AI

ChatGPT

Vraag wat u wilt of probeer een van de voorgestelde vragen om onze chat te starten.

Suggested prompts:

Can you explain why we use the [CLS] token for sentence representation?

What are the advantages of using BERT as a feature extractor instead of fine-tuning it?

How can I evaluate the performance of the classifier trained on BERT embeddings?

Awesome!

Completion rate improved to 9.09

bookFeature Extraction with BERT

Veeg om het menu te tonen

To use BERT as a feature extractor for sentiment analysis, you start by leveraging its pre-trained model to generate sentence embeddings—dense vector representations of your input text. These embeddings can then be used as features for downstream machine learning models, such as classifiers, without updating BERT’s weights. The typical workflow begins with preparing your text data and loading a pre-trained BERT model. You then tokenize your sentences, encode them using BERT, and extract the embeddings from one of BERT's layers (often the pooled output or the [CLS] token representation). These embeddings serve as rich, contextualized features that capture the semantic meaning of your sentences, which you can input into a classifier like logistic regression or support vector machines for sentiment prediction.

# Pseudocode for using BERT as a feature extractor (for illustration purposes)
from transformers import BertTokenizer, BertModel
import torch
from sklearn.linear_model import LogisticRegression

# Load the pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert = BertModel.from_pretrained('bert-base-uncased')

A pre-trained BERT model already contains rich contextual knowledge about language. The tokenizer converts raw text into tokens that match BERT’s vocabulary, while the model turns those tokens into dense vector representations. We use the uncased version of BERT, which ignores letter casing (so “Movie” and “movie” are treated the same).

# Example input sentences
sentences = ["I love this movie!", "This film was terrible."]

# Tokenize and encode sentences
inputs = tokenizer(sentences, return_tensors='pt', padding=True, truncation=True)

Each sentence is tokenized and padded to the same length so that they can be processed together in a single batch. The return_tensors='pt' argument makes the tokenizer return PyTorch tensors ready for the BERT model.

# Extract features with BERT (no gradient calculation needed)
with torch.no_grad():
    outputs = bert(**inputs)
    # Use the [CLS] token embedding as the sentence representation
    features = outputs.last_hidden_state[:, 0, :].numpy()

torch.no_grad() disables gradient computation because we are not training BERT — we’re only using it to produce embeddings. The [CLS] token (the first token of each input) is designed to represent the entire sentence, so its final hidden state serves as a compact feature vector for classification. The output tensor shape is (batch_size, hidden_size) — one vector per sentence.

# Use features as input for a classifier (e.g., logistic regression)
labels = [1, 0]  # 1 = positive, 0 = negative
clf = LogisticRegression()
clf.fit(features, labels)

These extracted embeddings can be used like any other numerical features. Here, a simple logistic regression classifier is trained to distinguish positive and negative sentiment using the BERT-derived sentence representations. This demonstrates how BERT can be used as a frozen feature extractor, providing rich language features without fine-tuning the full transformer model.

Note
Note

When your dataset is small or training resources are limited, using BERT as a feature extractor is often more practical than fine-tuning. This approach allows you to benefit from BERT’s language understanding without the computational cost and risk of overfitting associated with updating all its parameters.

question mark

Which of the following are benefits of using BERT as a feature extractor instead of fine-tuning?

Select the correct answer

Was alles duidelijk?

Hoe kunnen we het verbeteren?

Bedankt voor je feedback!

Sectie 3. Hoofdstuk 2
some-alt