 Statistical Operations
Statistical Operations
Performing various statistical operations on arrays is essential for data analysis and machine learning. NumPy provides functions and methods to perform them effectively.
Measures of Central Tendency
Measures of central tendency represent a central or representative value within a probability distribution. Most of the time, however, you will calculate these measures for a certain sample.
Here are the two main measures:
- Mean: the sum of all values divided by the total number of values;
- Median: The middle value in a sorted sample.
NumPy provides mean() and median() functions for calculating the mean and median, respectively:
12345678import numpy as np sample = np.array([10, 25, 15, 30, 20, 10, 2]) # Calculating the mean sample_mean = np.mean(sample) print(f'Sorted sample: {np.sort(sample)}') # Calculating the median sample_median = np.median(sample) print(f'Mean: {sample_mean}, median: {sample_median}')
We also displayed the sorted sample so you can clearly see the median. Our sample has an odd number of elements (7), so the median is simply the element at index (n + 1) / 2 in the sorted sample, where n is the size of the sample.
Note
When the sample has an even number of elements, the median is the average of the elements at index
n / 2andn / 2 - 1in the sorted sample.
1234import numpy as np sample = np.array([1, 2, 8, 10, 15, 20, 25, 30]) sample_median = np.median(sample) print(f'Median: {sample_median}')
Our sample is already sorted and has 8 elements, so n / 2 - 1 = 3 and sample[3] is 10. n / 2 = 4 and sample[4] is 15. Therefore, our median is (10 + 15) / 2 = 12.5.
Measures of Spread
Two measures of spread are variance and standard deviation. Variance measures how spread out the data is. It is equal to the average of the squared differences of each value from the mean.
Standard deviation is the square root of the variance. It provides a measure of how spread out the data is in the same units as the data.
NumPy has the var() function to calculate the variance of the sample and the std() function to calculate the standard deviation of the sample:
1234567import numpy as np sample = np.array([10, 25, 15, 30, 20, 10, 2]) # Calculating the variance sample_variance = np.var(sample) # Calculating the standard deviation sample_std = np.std(sample) print(f'Variance: {sample_variance}, standard deviation: {sample_std}')
Calculations in Higher Dimensional Arrays
All of these functions have a second parameter axis. Its default value is None, which means that the measure will be calculated along a flattened array (even if the original array is 2D or higher dimensional).
You can also specify the exact axis along which to calculate the measure:
12345678import numpy as np array_2d = np.array([[1, 2, 3], [4, 5, 6]]) # Calculating the mean in a flattened array print(np.mean(array_2d)) # Calculating the mean along axis 0 print(np.mean(array_2d, axis=0)) # Calculating the mean along axis 1 print(np.mean(array_2d, axis=1))
The picture below shows the structure of the exam_scores array used in the task:
Swipe to start coding
You are analyzing the exam_scores array, a 2D array of simulated test scores for 2 students (2 rows) across 5 different exams (5 columns).
- 
Calculate the mean score for each student by specifying the second keyword argument. 
- 
Calculate the median of all scores. 
- 
Calculate the variance of all scores. 
- 
Calculate the standard deviation of all scores. 
Рішення
Дякуємо за ваш відгук!
single
Запитати АІ
Запитати АІ
Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат
Can you explain how the axis parameter works in more detail?
What are some practical examples of using these statistics in real-world data analysis?
Can you show how to calculate other statistics like mode or percentiles with NumPy?
Awesome!
Completion rate improved to 3.7 Statistical Operations
Statistical Operations
Свайпніть щоб показати меню
Performing various statistical operations on arrays is essential for data analysis and machine learning. NumPy provides functions and methods to perform them effectively.
Measures of Central Tendency
Measures of central tendency represent a central or representative value within a probability distribution. Most of the time, however, you will calculate these measures for a certain sample.
Here are the two main measures:
- Mean: the sum of all values divided by the total number of values;
- Median: The middle value in a sorted sample.
NumPy provides mean() and median() functions for calculating the mean and median, respectively:
12345678import numpy as np sample = np.array([10, 25, 15, 30, 20, 10, 2]) # Calculating the mean sample_mean = np.mean(sample) print(f'Sorted sample: {np.sort(sample)}') # Calculating the median sample_median = np.median(sample) print(f'Mean: {sample_mean}, median: {sample_median}')
We also displayed the sorted sample so you can clearly see the median. Our sample has an odd number of elements (7), so the median is simply the element at index (n + 1) / 2 in the sorted sample, where n is the size of the sample.
Note
When the sample has an even number of elements, the median is the average of the elements at index
n / 2andn / 2 - 1in the sorted sample.
1234import numpy as np sample = np.array([1, 2, 8, 10, 15, 20, 25, 30]) sample_median = np.median(sample) print(f'Median: {sample_median}')
Our sample is already sorted and has 8 elements, so n / 2 - 1 = 3 and sample[3] is 10. n / 2 = 4 and sample[4] is 15. Therefore, our median is (10 + 15) / 2 = 12.5.
Measures of Spread
Two measures of spread are variance and standard deviation. Variance measures how spread out the data is. It is equal to the average of the squared differences of each value from the mean.
Standard deviation is the square root of the variance. It provides a measure of how spread out the data is in the same units as the data.
NumPy has the var() function to calculate the variance of the sample and the std() function to calculate the standard deviation of the sample:
1234567import numpy as np sample = np.array([10, 25, 15, 30, 20, 10, 2]) # Calculating the variance sample_variance = np.var(sample) # Calculating the standard deviation sample_std = np.std(sample) print(f'Variance: {sample_variance}, standard deviation: {sample_std}')
Calculations in Higher Dimensional Arrays
All of these functions have a second parameter axis. Its default value is None, which means that the measure will be calculated along a flattened array (even if the original array is 2D or higher dimensional).
You can also specify the exact axis along which to calculate the measure:
12345678import numpy as np array_2d = np.array([[1, 2, 3], [4, 5, 6]]) # Calculating the mean in a flattened array print(np.mean(array_2d)) # Calculating the mean along axis 0 print(np.mean(array_2d, axis=0)) # Calculating the mean along axis 1 print(np.mean(array_2d, axis=1))
The picture below shows the structure of the exam_scores array used in the task:
Swipe to start coding
You are analyzing the exam_scores array, a 2D array of simulated test scores for 2 students (2 rows) across 5 different exams (5 columns).
- 
Calculate the mean score for each student by specifying the second keyword argument. 
- 
Calculate the median of all scores. 
- 
Calculate the variance of all scores. 
- 
Calculate the standard deviation of all scores. 
Рішення
Дякуємо за ваш відгук!
single