Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Data Processing Methods | Brief Introduction
course content

Course Content

Data Preprocessing

Data Processing MethodsData Processing Methods

Data preprocessing occurs to prepare a dataset for the corresponding tasks. The preprocessing pipeline can be very different depending on the data you will be working with.

The main steps are:

  • Data cleaning;
  • Dimensionality reduction;
  • Data conversion;
  • Integration of data from different sources into one dataset.

Each of the stages includes many methods, for example, data cleaning includes working with missing data, which can be replaced by generating synthetic data, deleting rows with missing data, or filling them with an average over the entire column, etc.

Most models accept only processed data as input (images converted to matrices, text converted to vectors, etc.), but many modern models based on the Transformer architecture like BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre- trained Transformer) models can handle text data without converting it into numerical representations.

Despite this, inside the model, the data is still transformed into a numerical representation because only with such data can mathematical operations be performed.

Everything was clear?

Section 1. Chapter 2
course content

Course Content

Data Preprocessing

Data Processing MethodsData Processing Methods

Data preprocessing occurs to prepare a dataset for the corresponding tasks. The preprocessing pipeline can be very different depending on the data you will be working with.

The main steps are:

  • Data cleaning;
  • Dimensionality reduction;
  • Data conversion;
  • Integration of data from different sources into one dataset.

Each of the stages includes many methods, for example, data cleaning includes working with missing data, which can be replaced by generating synthetic data, deleting rows with missing data, or filling them with an average over the entire column, etc.

Most models accept only processed data as input (images converted to matrices, text converted to vectors, etc.), but many modern models based on the Transformer architecture like BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre- trained Transformer) models can handle text data without converting it into numerical representations.

Despite this, inside the model, the data is still transformed into a numerical representation because only with such data can mathematical operations be performed.

Everything was clear?

Section 1. Chapter 2
some-alt