I am writing this post as I attended a session on data analysis with R held by Matt Steele today at West Virginia University. During this session, I was introduced to the book, R for Data Science by Hadley Wickham, Mine Cetinkaya-Rundel, and Garrett Grolemund. In particular, Hadley Wickham is the chief scientist at RStudio. He was one of the main developers of the R base package. You can find his activities on the public R community.
With regards to the book R for Data Science, I would like to draw your attention to the supplemental materials that comes with it, available on Hadley Wickham’s website.
As this month’s theme is data visualization, I would like to highlight an excerpt from this supplemental material that focuses on plotting the relationship between flipper lengths and body masses of three species of penguins, with consideration of the islands they live on. Below is how this visual would look like.
The code for this plot is below:
In the above code, after the data is specified to be the penguins dataset, the mapping argument of the ggplot() function defines how variables in the dataset are mapped to the aesthetics of the plot. The mapping argument is always defined in the aes() function, and the x and y arguments of aes() specify which variables to map to the x and y axes. In our case, the flipper length is mapped to the x aesthetic and body mass to the y aesthetic.
Once we have the x and y specified for the plot, we would be ready to define a geom, which is the geometrical object that a plot uses to represent data. The function geom_point() adds a layer of points to the plot, which creates a scatterplot. In order for the points to represent the species of penguins and the islands they live on, we specify color = species and shape = island.
We also want to add a line of best fit based on a linear model with method = “lm”. And we can improve the labels of the plot using the labs() function to add a title, subtitle, a and y axes labels, as well as a legend.
To make the plot more accessible, we can adjust the color palette to be colorblind safe with the scale_color_colorlind() function from the ggthemes package.
Some variations of the plot include making the line of best fit non-linear by simply leaving geom_smooth() function empty.
Another variation may be to split the plot into facets, where each subplot would display only one subset of the data. To achieve this end, we use the facet_wrap() function with the argument ‘island’.
We can also make density plots with geom_density(). We can use the alpha aesthetic to add transparency to the filled density curves. This aesthetic range from 0 and 1, and it is set to 0.5 in the following plot.
If you are interested, please visit the supplemental materials page to R for Data Science.
CodeChat
This month’s CodeChat will be held on August 25 at 5pm EST. We will be focused on data visualization at this session, and as always you will have the opportunity to connect with other industry professionals to gain valuable market insights in addition to learning some code algorithms. You may sign up here for a meeting reminder or the meeting link is here if you would like to join directly.
Feedback
You can also use this message board to share with me any feedback you may have.