Visualization in Python with matplotlib
One more common usage of scatter plots is cluster analysis - finding if there is some relation between groups of observations and if we can statistically divide the observations into groups.
As we mentioned earlier, the
c parameter of the
.scatter function can be the array of colors with the same size as points data. This means we can assign to each observation a respective color (or size). In this chapter, we will assign a color in accordance with a region country is located in.
Let's expand the previous example. Assume we want to split the data into categories based on values of the
'life exp' column (life expectancy). The minimum value in this column is a little over 50, the maximum one is a little less 85. Using the
.cut() method of
pandas we can create a new column in the
data dataframe and then point each point according to the group. To perform that, we need to use the
for loop to iterate over each of the groups. Also we need to pre-define list of colors. In each step, we need to call the
There we used the
.cut() method of
pandas with 3 arguments:
- the first is the data we want to split into groups;
- the second is bins limits;
labelsis name of groups (if no value is set, then integers starting from 1 will be used);
Also, we used the
zip() function. This function is convenient when, for example, we want to iterate over two lists with the same lengths. Since we iterated over two lists, two dummy variables were used.
Within each of the
.scatter() functions we set the
label parameter so that we will be able to display the legend (don't forget about the