Trust in Excel or R

Trustworthiness

David Spiegelhalter has written in several places about trust in algorithms e.g. Harvard Data Science Review, Should We Trust Algorithms?, David Spiegelhalter, Jan 31, 2020, DOI: 10.1162/99608f92.cb91a35a.

The general idea is that we shouldn’t think just about trust in algorithms but also trustworthiness. He provides a checklist of questions he would like to ask:

  1. Is it any good when tried in new parts of the real world?
  2. Would something simpler, and more transparent and robust, be just as good?
  3. Could I explain how it works (in general) to anyone who is interested?
  4. Could I explain to an individual how it reached its conclusion in their particular case?
  5. Does it know when it is on shaky ground, and can it acknowledge uncertainty?
  6. Do people use it appropriately, with the right level of skepticism?
  7. Does it actually help in practice?

Excel vs R

A common criticism, I’ve heard for both sides of the argument, is that R/ Excel is not trusted because it is not clear how the model has been implemented. Excel users claim the WYSIWG interface is transparent and the R users claim that it is exactly this interface that make interrogating the model and testing it difficult and so not transparent.

Can we use David Spiegelhalter’s ideas to compare Excel and R for doing HTA?

The previous list is to do with the underlying algorithm, the data they’re used on and how the results are used. There is no mention of the implementation which is what we are interested in here.

So, borrowing from above, a possible Excel vs R check list could be:

  1. Is it able to simply implement a given model?
  2. Can someone easily understand the implementation (against the mathematical description)?
  3. Is the flow through the model clear?
  4. Are there tests and checks built in to the model?
  5. Are the inputs constrained or errors produced for bad values?

Of course, these elements are interrelated. If a model is easy to implement in software then it is more likely to be easy to understand and to follow its pipeline. So there is an assumption that the model builder has implemented the model in the best(ish) way appropriate for that software.

There are other benefit to using R, such as speed, extensibility and reuse but these aren’t directly linked to trustworthiness.

Advertisement

Cost-effectiveness acceptability SURFACE

Listening to a recent talk I realised that there doesn’t seem to be common name for something commonly used in health economic evaluation.

There are the staple plots which everyone will recognise: the cost-effectiveness plane and the cost-effectiveness acceptability curve (CEAC). The CEAC gives the probability of being cost-effective for different willingness to pay thresholds. This could similarly be done for any other model parameter e.g. say the sensitivity of a diagnostic test.

We can further show the probability of being cost-effective for two parameters on a grid of points. This could be a mesh, contour, surface, hex plot or whatever is your favourite.

The plots above is a related previous plot of mine incorporating a third parameter, showing the unit cost of a test at which the probability of being cost-effective is greater than 50% for different test sensitivity and specificity.

The point here is that this sort of 2D plot doesn’t appear to have an agreed name in the health economics literature in the same way as the CEAC; Perhaps a cost-effectiveness acceptability surface or CEAS?

Health Economics in R Data Hack

On January 21st – 22nd 2020 at Queen’s University Belfast, we hosted the second health economics in R event – a workshop/hackathon/data dive mash-up. (Read about the first one here).

Generally, day one was aimed more at people new to health economics and R. Day two was aim more at those more familiar with health economic evaluation who were interested in creating new R tools to tackle problems in health economics large and small.

There was a lot of interest in the event beforehand and it was oversubscribed. Attendees came from academia, government, consultancies and industry, including UCL, University of Bristol, Glasgow and NICE amongst others. In particular, we had a lot of attendees from Ireland and outside of London which was one of our intentions. The Belfast hackathon website has more details about the structure of the two days and the aims.

Day 1

The day was structured as a series of introduction to health economics and healthcare evaluation lectures. This was lead by Dr Felicity Lamrock, and covered cost-effectiveness modelling, uncertainty and Markov modelling. The day ended with a demonstration of all of these things in R using the package heemod. The slides can be found on GitHub here.

It was also great to socialise with everyone at the evening meal at the end of the first day, following a hard days work, at the lovely Riddle Hall.

Day 2

The hackthon started with brief project pitches by several attendees. Following this, the participants split into groups to work on these or other projects proposed within groups.

The hackers were supported by our expert research software engineers Rob Ashton, Igor Siveroni and Giovanni Charles from Imperial College London.

Projects included:

  • pdf2data – To take a pdf table of input data (from an HTA report) and wrangle it into a form for use in a cost-effectiveness model in R.
  • Formatting-data-for-network-meta-analysis – Converting systematic literature review data for network meta-analysis.
  • Creating shiny application for existing models.

We also had focused, advanced R skills sessions to up-skill current R users on Git and GitHub, Shiny, and package building in RStudio.

Some of the final day participants.

These meetings have been made possible by generous support from the Medical Research Council, Centre for Global Infectious Disease Analysis (Grant Reference MR/R015600/1), NIHR Health Protection Research Unit in Modelling Methodology and Imperial College London.

jagsAddIn: an RStudio AddIn

I have been aware of the RStudio AddIns for a while now but never really saw the usefulness in them (see here https://rstudio.github.io/rstudioaddins/). To me the benefit of using RStudio is so coding in R is more fun and including more menus and clicking and taking away from the keyboard seemed like a step backwards. I know that some of my VIM colleagues would agree!

However, I came across this GitHub Repo which is an AddIn of AddIns. That is, it has a list of lots of them (https://github.com/daattali/addinslist). What I found potentially useful was those that auto-generate some template code, like the generate Roxygen code from functions does in the dropdown menu.

To this end, I’ve written a small AddIn myself which generates some code for running jags from R. Each session in which I do this more or less uses the same boilerplate code and so I realised that this could actually be useful. Its something like this:

cat("model {
for (i in 1:N) {
}")
data <- list("fdfd")
params <- list("fdfd")
nc <- 1
ns <- 1000
nb <- 100
nt <- 1
bug_out <-
bugs(data = dat,
inits = inits,
parameters.to.save = params,
n.chains = nc,
n.iter = ns,
n.burnin = nb,
n.thin = nt,
DIC = TRUE
working.directory = getwd)

I have since had my AddIn included in the addinslist table so you can see for yourself if it is in fact useful. Visit here https://github.com/n8thangreen/jagsAddIn.

Reconstructing data from Kaplan-Meier curves

Using the methodology given in Guyot et al. BMC Medical Research Methodology 2012, 12:9 (http://www.biomedcentral.com/1471-2288/12/9), we held a practical session to show how to do this in R.

The algorithm in Guyot (2012) is included in the digitise() function in the survHE package.

The first step is to extract the data points from an image by using a plot digitiser app. The plot used to digitise is below. The red points are manual selected along the length of the curve.

The software uses by Guyot (2012) is called DigitiseIt. In order to use it for the purposes of this tutorial, a fee needs to be paid. Therefore we used a freely available equivalent tool called WebPlotDigitiser. The tools are very similar in functionality. The output of the tools is formated differently however. Also each digitisation may be slightly different simply due to manual selection of points. So we needed to do a little preprocessing before we could use the existing function.

There are 2 matrices required. The survival data taken directly from the Kaplan-Meier plot and the at-risk
matrix which include the row numbers at which the data is divided in to intervals. Read in the data from the digitiser.

Guyot_data <- read.csv("Guyot (2012) data_WebPlotDigitizer2.csv", header = TRUE)
head(Guyot_data)
## Time Survival
## 1 0.0000 1.0000
## 2 0.1722 1.0018
## 3 0.6887 0.9929
## 4 1.2912 0.9821
## 5 2.0660 0.9768
## 6 2.6686 0.9732

Determine which rows the upper and lower values of each interval are.

find_interval_limits <- function(start_time,
                                 surv_time){
  
  if (max(surv_time) > max(start_time))
    stop("Intervals must span all survival times. Later interval missing.")
  
  interval <- 1
  end_time <- start_time[2]
  upper <- NULL
  
  for (i in seq_along(surv_time)){

    if (surv_time[i] >= end_time) {

      upper[interval] <- i - 1 
      interval <- interval + 1
      end_time <- start_time[interval + 1]
    } 
  }
  
  cbind(lower = c(1, upper + 1),
        upper = c(upper, length(surv_time)))
}
interval_limits <-
find_interval_limits(start_time = c(0, 10, 20, 30, 40, 50, 60),
surv_time = Guyot_data$Time)

We can then add these extraction date specific values to the numbers at risk data take from a table in the paper.

Guyot_atrisk <- read.csv("Guyot (2012) at-risk data.csv", header = TRUE)
Guyot_atrisk[, c("lower", "upper")] <- interval_limits
head(Guyot_atrisk)
## Interval time lower upper nrisk
## 1 1 0 1 29 213
## 2 2 10 30 53 122
## 3 3 20 54 68 80
## 4 4 30 69 86 51
## 5 5 40 87 105 30
## 6 6 50 106 111 10
# required row number
Guyot_data <- cbind("k" = rownames(Guyot_data), Guyot_data)

The arguments of digitise are file names of text file so we need to save these data first

write.table(Guyot_atrisk, file = "Guyot_atrisk.txt", row.names = FALS
write.table(Guyot_data, file = "Guyot_data.txt", row.names = FALSE)
digitise(surv_inp = "Guyot_data.txt",
         nrisk_inp = "Guyot_atrisk.txt")
IPDdata <- read.table("IPDdata.txt", header = TRUE)
KMdata <- read.table("KMdata.txt", header = TRUE)
head(IPDdata)
## time event arm
## 1 0.6887 1 1
## 2 1.2912 1 1
## 3 1.2912 1 1
## 4 1.2912 1 1
## 5 2.0660 1 1
## 6 2.6686 1 1
head(KMdata)
## time n.risk n.event n.censored
## 1 0.0000 213 0 0
## 2 0.1722 213 0 2
## 3 0.6887 211 1 2
## 4 1.2912 208 3 3
## 5 2.0660 202 1 3
## 6 2.6686 198 1 2

Find the Kaplan-Meier estimates for the recovered data and compare summary statistics with Guyot (2012).

IPD <- as.data.frame(IPDdata)
KM.est <- survfit(Surv(time, event) ~ 1,
data = IPDdata,
type = "kaplan-meier",)
summary(KM.est, times = c(12, 24, 36))
## Call: survfit(formula = Surv(time, event) ~ 1, data = IPDdata, type = "kaplan-meier")
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 12 109 65 0.647 0.0355 0.581 0.721
## 24 70 23 0.506 0.0382 0.436 0.586
## 36 39 4 0.468 0.0398 0.396 0.553
survminer::surv_median(KM.est)
## strata median lower upper
## 1 All 25.0502 15.9254 49.1535

We see that the results are very similar to the original survival curve.

Health Economics in R Hackathon

Imperial College 2019

On November 6th – 7th 2019 at Imperial College London, we hosted the first health economics in R hackathon.

The event was aimed at health economists, statistcians and R users who are interested in creating new R tools to tackle problems in health economics large or small.

There was a lot of interest in the event (despite the fact the dates coincided with one of the main health economics conferences!). Attendees came from academia, government, consultancies and industry, including UCL, University of Bristol, Glasgow, NICE amongst others. The hackathon website has details about the structure of the two days and some of the ambitions.

The hackthon started with brief project pitches by Nathan Green, Howard Thom, Sangeeta Bhatia, Nichola Naylor, Sam Abbott. Following this, the participants split into groups to work on these or other projects proposed within groups.

There was plenty of enthusiasm and hardwork and even something tangible to show for it at the end.

It was also great to socialise with everyone at the evening meal at the end of the first day following a hard days hacking.

The final presented projects were:

  • SHEPRD – Signposting Health Economic Packages in R for Decision Modelling
  • hermes6 – Writing fast Markov models. A comparison of different implementations in R
  • inflatably – Inflating costs to present value (PV) using different source rates
  • xlerate – Run Excel from R without Excel
  • Packaging an R cost-effectiveness tutorial for ease-of-use and reproducability

We also had focused advanced R skills sessions to upskill current R users on Git and GitHub, Shiny, testing (unit tests/testthat, Travis CI), and package building in RStudio.

There will be a second event happening at Queen’s University Belfast on 21st and 22nd of January 2020. The format will be slightly different with a half-day of teaching followed by a data dive (delving into a health economics related dataset). This is aimed at less experienced R users than the hackathon. The event is proving even more popular than the London hackathon so we are really looking forward to it.

These meetings have been made possible by generous support from the Medical Research Council, Centre for Global Infectious Disease Analysis (Grant Reference MR/R015600/1), NIHR Health Protection Research Unit in Modelling Methodology and Imperial College London.

Creating WordPress blog posts with RMarkdown

R Markdown

This blog post is written as an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.

I’ve use the RWordPress package and the instructions from this blog post:

https://cognitivedatascientist.com/tag/knit2wp/

#activate the necessary libraries
library(RWordPress)
library(knitr)
library(XMLRPC)
library(RCurl)

# tell RWordPress how to set the user name, password, and URL for your WordPress site.
options(WordPressLogin = c(n8thangreen20 = 'PASSWORD'),
        WordPressURL = 'https://healthdatacounts.wordpress.com/')

# tell knitr to create the html code and upload it to your WordPress site
knit2wp(input = "postname.Rmd",
        title = "posttitle",
        publish = FALSE,
        action = "newPost")

De-duplicating clinic visits

Problem

We want to only keep an individual’s clinic visits which are for new infections.
Therefore, want to remove repeat visit to the clinic
where a repeat is defined as within 6 weeks of the previous visit.

Solutions

There are a number of way to tackle this problem which vary in terms of computation speed, simplicity and intuition.

To begin, load the packages we’re going to need:

library(reshape2)
library(dplyr)
library(readxl)
library(lubridate)
library(knitr)

Load in the data. This is in the form of individual patient data (IPD) clinic visit dates and ids.

load(file = here::here("cleaned_visit_date.RData"))
dat0 <- dat
kable(head(dat))

 uniq_iddateperson_id
67219F053090_2012-03-072012-03-07F053090
135647F105975_2014-05-212014-05-21F105975
21106716M03749_2016-04-182016-04-1816M03749
49805M091207_2012-03-272012-03-27M091207
39751F066543_2011-12-292011-12-29F066543
37228F108730_2011-12-022011-12-02F108730

Simple loops solution

We could simply step through the original data line by line.

library(dplyr)

keep_record <- c()
dat <- dat0
dat <-
  dat[1:5000, ] %>% 
  arrange(person_id, date)

previous_patient <- ""
previous_date <- 0

for (i in seq_len(nrow(dat))){

  person_id <- dat[i, "person_id"]
  date <- dat[i, "date"]

  # first visit
  if (person_id != previous_patient){
    # print("new id")
    keep <- TRUE
    previous_patient <- person_id
    previous_date <- date

  } else if ((date - previous_date) > 42){
    # print("over 42")
    keep <- TRUE
    previous_date <- date
  } else{

    keep <- FALSE
  }

  keep_record <- c(keep_record, keep)
}

table(keep_record)

## keep_record
## FALSE  TRUE 
##    31  4969

Tidyverse solution

So within a while loop we group by individuals and for the earliest date calculate a 6 week time window (end date window_end) and a flag whether or not subsequent clinic visits are within this (within_window).

We keep track of 2 output patient lists: repeat_visit_ids and first_visit_ids.
The later is just the id from the first date.
Once we have added to these list given the information from window_end then we remove the first date and the repeat visit associated with this visit.
Finally, we repeat the whole thing until we reach the end of the dataframe.

time_window <- weeks(6) #  duration(42, "days")          #6 weeks in days
repeat_visit_ids <- NULL   #initialise
first_visit_ids  <- NULL
dat <- dat0

while(nrow(dat) > 0) {

  dat <-
    dat %>%
    group_by(person_id) %>%
    dplyr::arrange(date, .by_group = TRUE) %>% 
    mutate(window_end = min(date) + time_window,
           within_window = date <= window_end,
           first_date = date == min(date)) %>% 
    ungroup()

  repeat_visit_ids <-
    dat %>%
    filter(within_window & !first_date) %>%
    transmute(uniq_id) %>%
    rbind(repeat_visit_ids, .)

  first_visit_ids <-
    dat %>%
    filter(first_date) %>%
    transmute(uniq_id) %>%
    rbind(first_visit_ids, .)

  dat <-
    dat %>%
    filter(!within_window)
}

The output is a list of unique visit ids.

kable(head(first_visit_ids))

uniq_id
14F00004_2014-05-07
14F00006_2014-05-07
14F00016_2014-05-07
14F00019_2014-05-07
14F00021_2014-05-07
14F00027_2014-06-12
nrow(first_visit_ids)

## [1] 90389

Pre-processed data structure solution

We could create a data structure for easier computation.

First, lets index the repeat visits for each patient by creating a ‘visit’ column with a counter.

dat <- dat0

dat <-
  dat %>%
  group_by(person_id) %>%
  arrange(date) %>%
  mutate(visit = row_number()) %>%
  arrange(person_id, visit)

kable(head(dat, n = 10))

uniq_iddateperson_idvisit
14F00004_2014-05-072014-05-0714F000041
14F00004_2014-05-272014-05-2714F000042
14F00004_2014-06-062014-06-0614F000043
14F00006_2014-05-072014-05-0714F000061
14F00016_2014-05-072014-05-0714F000161
14F00019_2014-05-072014-05-0714F000191
14F00021_2014-05-072014-05-0714F000211
14F00021_2014-06-112014-06-1114F000212
14F00027_2014-06-122014-06-1214F000271
14F00034_2014-06-062014-06-0614F000341

For any computation using this data set we can simply remove the single visit individuals because we already know about these.
This reduced the size of the dataframe by some way.

We can cast the data to a wide form to show a kind of patient trajectory such that the columns of visit counts and entries are dates (on a numeric scale).

yy <- dcast(person_id ~ visit,
            data = dat, value.var = "date")

yy <-
  yy %>%
  filter(!is.na(`2`))

kable(head(yy[ ,1:11], n = 10))

person_id12345678910
14F00004161971621716227NANANANANANANA
14F000211619716232NANANANANANANANA
14F00048161981620416484166181700117224174881774417831NA
14F00054161981620917080NANANANANANANA
14F0007116226162731638016394164081652716555NANANA
14F00073161981635917416NANANANANANANA
14F000751626117031NANANANANANANANA
14F001061635916378NANANANANANANANA
14F001081620216393NANANANANANANANA
14F00109162021623016237NANANANANANANA

Rather than this flat representation, we can split the data in to a list of visit frequencies.
That is each element in the list consist of trajectories for those individual with that number of visits.
The elements in the dataframes are the difference between the ith visit and all of the subsequent.
We’ll use a subsample just to keep the computation time down.
Note that I remove the patient id so that I can do compuation on all of the visit times before including it back in.

yy <- yy[1:1000, ]
visits_wide <- select(yy ,-person_id)
zz <- list()

for (i in seq_along(visits_wide)) {

  zz[[i]] <- visits_wide - visits_wide[, i]   # difference between origin visit and subsequent dates
  zz[[i]][zz[[i]] <= 0] <- NA                 # replace negative times

  # are all visits/ whole row NAs?
  all_NA  <- rowSums(is.na(zz[[i]])) != ncol(zz[[i]])
  zz[[i]]$person_id <- yy$person_id           # include patient id
  zz[[i]] <- zz[[i]][all_NA, ] # remove rows
}

lapply(zz[1:3], function(x) x[1:6, 1:11])

## [[1]]
##    1   2    3   4   5    6    7    8    9 10 11
## 1 NA  20   30  NA  NA   NA   NA   NA   NA NA NA
## 2 NA  35   NA  NA  NA   NA   NA   NA   NA NA NA
## 3 NA   6  286 420 803 1026 1290 1546 1633 NA NA
## 4 NA  11  882  NA  NA   NA   NA   NA   NA NA NA
## 5 NA  47  154 168 182  301  329   NA   NA NA NA
## 6 NA 161 1218  NA  NA   NA   NA   NA   NA NA NA
## 
## [[2]]
##     1  2    3   4   5    6    7    8    9 10 11
## 1  NA NA   10  NA  NA   NA   NA   NA   NA NA NA
## 3  NA NA  280 414 797 1020 1284 1540 1627 NA NA
## 4  NA NA  871  NA  NA   NA   NA   NA   NA NA NA
## 5  NA NA  107 121 135  254  282   NA   NA NA NA
## 6  NA NA 1057  NA  NA   NA   NA   NA   NA NA NA
## 10 NA NA    7  NA  NA   NA   NA   NA   NA NA NA
## 
## [[3]]
##     1  2  3   4   5   6    7    8    9 10 11
## 3  NA NA NA 134 517 740 1004 1260 1347 NA NA
## 5  NA NA NA  14  28 147  175   NA   NA NA NA
## 12 NA NA NA 151 785  NA   NA   NA   NA NA NA
## 15 NA NA NA 185  NA  NA   NA   NA   NA NA NA
## 18 NA NA NA  69 182  NA   NA   NA   NA NA NA
## 20 NA NA NA 161  NA  NA   NA   NA   NA NA NA

The reason for splitting the data in this way is that now doing operations on it to deduplicate are straightforward because its kind of already sorted.
The below was my first attempt at this approach.
It slow and pretty ugly.

out <- zz
from_visit_seq <- head(seq_along(zz), -1)

for (i in from_visit_seq){

  # only consider later columns (to the right)
  keep_cols <- names(out[[i]])[!names(out[[i]]) %in% (1:i)]
  future_visits <- out[[i]][ ,keep_cols]

  for (j in out[[i]]$person_id){

    if (nrow(out[[i]]) == 0) break

    future_id <- future_visits[future_visits$person_id == j, ]

    # which visit number (column name) is within time window (42 days) for each patient?
    times <- select(future_id, -person_id)
    visit_rm <- colnames(times)[times <= time_window & !is.na(times)]

    if (length(visit_rm) > 0) {

      # remove these repeat visits in list element
      for (k in as.numeric(visit_rm))
        out[[k]] <- out[[k]][out[[k]]$person_id != j, ]
    }
  }
}

lapply(out[1:3], function(x) x[1:6, 1:11])

## [[1]]
##    1   2    3   4   5    6    7    8    9 10 11
## 1 NA  20   30  NA  NA   NA   NA   NA   NA NA NA
## 2 NA  35   NA  NA  NA   NA   NA   NA   NA NA NA
## 3 NA   6  286 420 803 1026 1290 1546 1633 NA NA
## 4 NA  11  882  NA  NA   NA   NA   NA   NA NA NA
## 5 NA  47  154 168 182  301  329   NA   NA NA NA
## 6 NA 161 1218  NA  NA   NA   NA   NA   NA NA NA
## 
## [[2]]
##      1  2    3    4    5  6  7  8  9 10 11
## 273 NA NA 2069   NA   NA NA NA NA NA NA NA
## 398 NA NA  831 1317 1321 NA NA NA NA NA NA
## 494 NA NA  995 1038   NA NA NA NA NA NA NA
## 504 NA NA  258   NA   NA NA NA NA NA NA NA
## 678 NA NA  578  672 1519 NA NA NA NA NA NA
## NA  NA NA   NA   NA   NA NA NA NA NA NA NA
## 
## [[3]]
##       1  2  3  4  5  6  7  8  9 10 11
## NA   NA NA NA NA NA NA NA NA NA NA NA
## NA.1 NA NA NA NA NA NA NA NA NA NA NA
## NA.2 NA NA NA NA NA NA NA NA NA NA NA
## NA.3 NA NA NA NA NA NA NA NA NA NA NA
## NA.4 NA NA NA NA NA NA NA NA NA NA NA
## NA.5 NA NA NA NA NA NA NA NA NA NA NA

I realised that the break could be replaced with a next if I moved the if statement up a level outside of the j for loop.
Also, I don’t have to do the duplicate visit identification and removal separately.
I only need to consider the visits in the future of the current visit, hence the (i+1):ncol(tmp) term.
This produces a much cleaner chunk of code.

from_visit_seq <- head(seq_along(zz), -1)

for (i in from_visit_seq){

  if (nrow(out[[i]]) == 0) next

  for (j in out[[i]]$person_id){

    # stop at first time outside window
    t <- i + 1
    out_person <- out[[i]][out[[i]]$person_id == j, ]

    # remove these repeat visits in list element
    while(out_person[, t] <= time_window & !is.na(out_person[, t])){
      out[[t]] <- out[[t]][out[[t]]$person_id != j, ]
      t <- t + 1
    }
  }
}

lapply(out[1:3], function(x) x[1:6, 1:11])

## [[1]]
##    1   2    3   4   5    6    7    8    9 10 11
## 1 NA  20   30  NA  NA   NA   NA   NA   NA NA NA
## 2 NA  35   NA  NA  NA   NA   NA   NA   NA NA NA
## 3 NA   6  286 420 803 1026 1290 1546 1633 NA NA
## 4 NA  11  882  NA  NA   NA   NA   NA   NA NA NA
## 5 NA  47  154 168 182  301  329   NA   NA NA NA
## 6 NA 161 1218  NA  NA   NA   NA   NA   NA NA NA
## 
## [[2]]
##      1  2    3    4    5  6  7  8  9 10 11
## 273 NA NA 2069   NA   NA NA NA NA NA NA NA
## 398 NA NA  831 1317 1321 NA NA NA NA NA NA
## 494 NA NA  995 1038   NA NA NA NA NA NA NA
## 504 NA NA  258   NA   NA NA NA NA NA NA NA
## 678 NA NA  578  672 1519 NA NA NA NA NA NA
## NA  NA NA   NA   NA   NA NA NA NA NA NA NA
## 
## [[3]]
##       1  2  3  4  5  6  7  8  9 10 11
## NA   NA NA NA NA NA NA NA NA NA NA NA
## NA.1 NA NA NA NA NA NA NA NA NA NA NA
## NA.2 NA NA NA NA NA NA NA NA NA NA NA
## NA.3 NA NA NA NA NA NA NA NA NA NA NA
## NA.4 NA NA NA NA NA NA NA NA NA NA NA
## NA.5 NA NA NA NA NA NA NA NA NA NA NA

Finally, include a visit count column and remove empty dataframes.

names(out) <- as.character(seq_along(out) + 1)

for (i in names(out)){
  if (nrow(out[[i]]) == 0) out[[i]] <- NULL
  else out[[i]]$visit <- as.numeric(i)
}

Stack the lists and include back in the initial visits to give all visits with duplicates removed.

res <-
  purrr::map_dfr(out, `[`, c("person_id", "visit")) %>% 
  merge(dat) %>% 
  rbind.data.frame(dat[dat$visit == 1, ]) %>% 
  arrange(person_id, visit)

kable(head(res))

person_idvisituniq_iddate
14F00004114F00004_2014-05-072014-05-07
14F00004214F00004_2014-05-272014-05-27
14F00006114F00006_2014-05-072014-05-07
14F00016114F00016_2014-05-072014-05-07
14F00019114F00019_2014-05-072014-05-07
14F00021114F00021_2014-05-072014-05-07
nrow(res)

## [1] 64799