Decision Trees & Random Forests

Our September 2020 meetup was on the topic of Decision Trees & Random Forests. It featured three presentations on the topic of tree-based models in R:

  1. Hierarchical clustering in R (Kavana Rudresh)

  2. treeheatr - an R package to create interpretable decision tree visualizations (Trang Le)

  3. Decision trees vs. random forests (Karla Fettich)

Hierarchical clustering in R

Our first presenter was Kavana Rudresh, Enterprise Business Intelligence Manager for Strategic Analytics at Comcast Corporation. Materials from Kavana’s presentation are available here.

Kavana explained some key concepts within clustering, before focusing on hierarchical algorithms. Within hierarchical clustering, she discussed the distinction between bottom-up (agglomerative/additive) and top-down (divisive) types, before moving on to questions of how to measure similarity and how to choose the number of clusters.

Kavana also walked us through a script which employs some of these techniques with mall customer data and discussed tips for using clustering techniques with messy real-world data.

In the discussion period we talked a bit about how dendrograms can be misinterpreted; it is important to look at where leaves join rather than how they are arranged relative to each other. One suggestion was to visualize a baby’s crib mobile, where the branches can rotate without changing the structural relationship between the leaves.

treeheatr - an R package to create interpretable decision tree visualizations

Trang Le is a postdoctoral fellow with Jason Moore at the Computational Genetics Lab, University of Pennsylvania. She’s the author and maintainer of 5 R packages and active contributor of the automated machine learning tool TPOT.

Trang’s presentation was about the package treeheatr, which she authors and maintains. treeheatr creates interpretable decision tree visualizations which incorporate a heatmap of the data at the tree’s leaf nodes. The presentation slides are available here and a recording of the presentation is available here. Trang started by reviewing some other options for visualizing decision tree models, before introducing treeheatr and how to use it. A vignette is available here.

You can learn more about treeheatr on the github website, in the github repository and on CRAN!

Decision trees vs. random forests

Karla works as Head of Algorithm Development at Orchestrall, where she leads behavioral data analytics efforts and predictive model development for healthcare IT innovation. She is also an organizer of R-Ladies Philly!

Karla provided an introduction to random forests. Karla’s slides are available here and recording of her presentation is available here. Her presentation included some background on decision trees versus random forests, an explanation of how random forest algorithms work at a high level, as well as some discussion of advantages and disadvantages of the approach.

Karla also walked through an implementation with some fictitious data - a case of the Mondays - and highlighted some “gotchas” to watch out for.

Thank you

Many thanks to our fantastic presenters, Kavana, Trang, and Karla, and to R-Ladies Global for making the virtual event possible via Zoom.

Related