A random rstats year overview

It’s been a year or so since I stopped actively following the rstats community or R development in general. I checked in here and there but I didn’t need to use R on a daily basis so I let it slip. However recently I felt the need to pick it up again and what better way than to write a blogpost about it.

How do you keep up with a programming language and its community? Methodically.

The Source

Let’s start at the source. There have been 5 releases between July 2019 and July 2020:

Version Date Name
3.6.2 2019-12-12 Dark and Stormy Night
3.6.3 2020-02-29 Holding the Windsock
4.0.0 2020-04-24 Arbor Day
4.0.1 2020-06-06 See Things Now
4.0.2 2020-06-22 Taking Off Again

The easiest way to find out what has changed in each version is to check out R News. The minor version updates mainly contain bug fixes and updates to low level code so I’ll skip that. It is the major release that has some “SIGNIFICANT USER-VISIBLE CHANGES”, I like how they put that in capital letters.

The two biggest changes for me are a new raw string syntax and stringsAsFactors being FALSE by default. Yes, you heard it right, the default value for stringsAsFactors has finally changed.

If you want to know why it was set to TRUE in the first place, I highly recommend Roger Peng’s blogpost. As for the raw string syntax, that’ll save a lot of backslashes when working with regex. Here’s a quick example of what you can now do:

r"(C:\Program Files)"
## [1] "C:\\Program Files"

Check out Luke Tierney’s talk at useR!2020 for more information about these and other changes.

The one thing that really strikes me, as a casual R user at the moment, is that none of the changes in the changelog really affect me. Perhaps my reliance on packages has become so great that I’ll only notice changes to core R when or if package authors decide to pass them through.

The Postman always delivers…packages

So instead of looking at core R maybe I should be looking at changes to packages. At the time of writing there are 16035 packages. Here is me going through 16035 changelogs:

Of course I’m not going to do that! A better idea is to just look at updated and new packages. Joseph Rickert did something similar on the Revolutions blog using data from Dirk Eddelbeutel’s CRANberries website. The website publishes news about any updated and new packages, there’s even an accompanying Twitter account. There are two ways I can get at the data I want: download the tweets or extract the data from the website’s RSS feed. I’ll go for the second option because it’s much richer information and I haven’t done that before.

I’ll be using the feedeR package by Andrew Colliers to tap the RSS feed. The script that underpins CRANberries also outputs results by month of the year so all I have to do is point my code to the correct time periods.

library(feedeR)

get_data <- function(date){
  year <- year(date)
  month <- stringr::str_pad(month(date), 2, pad="0")
  return(feed.extract(glue::glue("http://dirk.eddelbuettel.com/cranberries/{year}/{month}/index.rss")))
}

map(ymd(20190701) + months(0:11), get_data) %>%
  map_df(~pluck(.x, "items")) -> pkg_data

glimpse(pkg_data)
## Rows: 16,407
## Columns: 5
## $ title       <chr> "Package SpatialPack  updated to version 0.3-8  with prev…
## $ date        <dttm> 2019-07-31 17:02:00, 2019-07-31 17:02:00, 2019-07-31 17:…
## $ link        <chr> "http://dirk.eddelbuettel.com/cranberries/2019/07/31#Spat…
## $ description <chr> "\n<p>\n<strong>Title</strong>:  Tools for Assessment the…
## $ hash        <chr> "85f698ba19c8492f", "f4a6dc277a61972b", "a578f1fb1dd7f8de…

So what have I missed in the past year? Let’s visualise the number of updated/new packages and see.

pkg_data <- pkg_data %>%
  mutate(type=ifelse(stringr::str_detect(title, 'updated'), 'updated', 'new'),
         date=as_date(date))

table(pkg_data$type)
## 
##     new updated 
##    4634   11773

There have been 4634 new packages on CRAN in the past year, that’s amazing! Even on a monthly basis the growth is crazy, the monthly number of new packages almost doubled in the first half of 2020. The pattern is repeated for the monthly number of updated packages. I wonder if it is a seasonal effect or some other external factor. I’ve added the releases of new R versions but it’s hard to say if that is the cause.

Nevertheless, this doesn’t really solve my problem as I don’t really want to go through 4634 packages (never mind the 11K updates). Perhaps I can extract the most interesting ones by looking at number of downloads. The cranlogs package from the nice people at RStudio gives us the number of downloads from the RStudio CRAN Mirror.

the_pkgs <- pkg_data %>%
  mutate(package = stringr::str_match(title, '(?<=[Pp]ackage )\\S*')[1]) %>% 
  # regex explained: any non-white space characters after the word package
  select(package, type) %>%
  distinct()

library(cranlogs)

split(the_pkgs$package, ceiling(seq_along(the_pkgs$package)/500)) %>%
  # a split is needed until cranlogs#10 is resolved
  map_df(~cran_downloads(package=.x, from='2019-07-01', to='2020-06-30', total=TRUE)) %>%
  select(package, count) %>%
  distinct() %>%
  left_join(the_pkgs) %>%
  arrange(desc(count)) -> pkg_downloads

I have my doubts about the log transformation on the x-axis but the conclusion remains the same. There are a number of packages that are downloaded a very large number of times but the majority is only downloaded a handful of times. Using number of downloads as a proxy for ‘interestingness’ I can now narrow the field. Let’s see what the top 10 looks like.

Top 10 downloaded packages
New Packages Updated Packages
package count package count
aws.s3 33247993 aws.ec2metadata 38373146
lifecycle 8641283 rsconnect 37618108
fastmap 3407634 aws.s3 33247993
lme4 3114253 rlang 17630623
psych 1664886 ggplot2 13939669
fracdiff 1579339 dplyr 13514400
subselect 499165 vctrs 12852910
leaflet.providers 367060 Rcpp 12411703
parameters 223878 ellipsis 11457747
lsei 223113 tibble 11394919
Source RStudio CRAN Mirror


Obviously the majority of the top packages are going to be from the tidyverse. These are mostly mature packages so I don’t expect many changes in the core functionality, meaning I’ll review the recent updates whenever I use the package. Let’s focus on the new packages instead. Here we see aws.s3 and lme4 which aren’t really new packages, the former has lived on GitHub for a while and the latter is just a port to S4 classes. We also see some packages that were new but already have been archived, such as lsei. Good to see CRAN is being kept clean and tidy. The list of new packages does reveal some packages to explore (e.g. fastmap looks interesting) but overall it is still too long. If only there was someone that had done the job of filtering interesting packages for me…

I am La-Zy, God of dawdle

That someone is Joseph Rickert, now at RStudio, where he posts his top new packages every month. Here’s an example for September 2019, they’re even categorized!

I could stop here and go through the 12 blogposts manually but where would the fun be in that. Instead let’s use rvest to scrape the posts and extract the list of new packages.

library(rvest)

fetch_data <- function(url){
  post <- read_html(url) %>%
    html_nodes(".article-entry") %>% 
    html_nodes("h3, p")

  headers <- map_chr(post, ~ ifelse(html_name(.x) == 'h3', 
                                    html_text(.x), NA))
  packages <- map_chr(post, ~ ifelse(html_name(.x) == 'p', 
                                     (html_nodes(.x, 'a') %>% html_text())[1], NA))
  
  tibble(Category=headers, Package=packages) %>%
    tidyr::fill(Category) %>%
    tidyr::drop_na() -> result
  
  return(result)
}

urls <- readLines('top40_rviews_pkgs.txt')
rviews_pkgs <- map_dfr(urls, fetch_data)

head(rviews_pkgs)
## # A tibble: 6 x 2
##   Category     Package   
##   <chr>        <chr>     
## 1 Data         eia       
## 2 Data         litteR    
## 3 Data         rSymbiota 
## 4 Data Science bdpar     
## 5 Data Science modeLLtest
## 6 Finance      lazytrade

It was easier to get just the packages but I also wanted to get the category, hence the double use of map_chr. Either way, the scraping resulted in 441 packages (because 11 x 40 = 441, right?). Let’s see how the packages are distributed over category.

It makes sense Statistics would be the top category so I guess that’s a good sanity check. Category is an interesting property by which I can further filter down. For example, I’m not really interested in packages from the Genomics or Medicine category.

For now I’ll restrict myself to packages from Utilities and Machine Learning which is still 128 packages. Maybe if I join this to the number of downloads I can work my way down.

Category Package count
Utilities fastmap 3407634
Utilities arrow 162767
Utilities renv 118931
Utilities warp 93468
Machine Learning mlr3 49123
Utilities googlesheets4 45286
Utilities hardhat 38556
Utilities gt 28884
Utilities pins 23925
Machine Learning mlr3pipelines 22274
Utilities progressr 20378
Machine Learning azuremlsdk 13704
Machine Learning quanteda.textmodels 11196
Machine Learning mlr3proba 9840
Utilities gridtext 9640
Utilities babelwhale 8268
Utilities modelsummary 7986
Machine Learning modelStudio 7851
Machine Learning forecastML 7249
Utilities butcher 7020


I have already mentioned fastmap but just going through the list amazes me. I mean pins solves my data problems, gt solves my table formatting problems (I’ve been using it throughout this blogpost), renv solves my environment problems, etc., etc.

It looks like the past year has been good and I have missed a lot. So if you don’t mind, I’ll now be going down the rabbithole and try all these goodies out.