course content

Course Content

Visualization in Python with matplotlib

Grouping ObservationsGrouping Observations

One more common usage of scatter plots is cluster analysis - finding if there is some relation between groups of observations and if we can statistically divide the observations into groups.

As we mentioned earlier, the c parameter of the .scatter function can be the array of colors with the same size as points data. This means we can assign to each observation a respective color (or size). In this chapter, we will assign a color in accordance with a region country is located in.

Let's expand the previous example. Assume we want to split the data into categories based on values of the 'life exp' column (life expectancy). The minimum value in this column is a little over 50, the maximum one is a little less 85. Using the .cut() method of pandas we can create a new column in the data dataframe and then point each point according to the group. To perform that, we need to use the for loop to iterate over each of the groups. Also we need to pre-define list of colors. In each step, we need to call the .scatter() function.

There we used the .cut() method of pandas with 3 arguments:

  • the first is the data we want to split into groups;
  • the second is bins limits;
  • the labels is name of groups (if no value is set, then integers starting from 1 will be used);

Also, we used the zip() function. This function is convenient when, for example, we want to iterate over two lists with the same lengths. Since we iterated over two lists, two dummy variables were used. Within each of the .scatter() functions we set the label parameter so that we will be able to display the legend (don't forget about the plt.legend()).

Section 3.

Chapter 7