# News

November 15, 2016 - Alun Jonesdata products modelling

# Occupancy Modelling in Marine Data - An example use case with OBIS

### Summary

We have produced a proof-of-concept note on the application of occupancy modelling to extract robust temporal trends from unstructured data sources such as OBIS. This post describes the problems of extracting trends when survey effort is patchy in space and time. It introduces the concept of occupancy modelling to address this, and shows how the methods applied to more systematic surveys of specific taxonomic groups in terrestrial systems can be adapted to the less structured OBIS data, where data from a whole range of different kinds of surveys, targeting different habitats and taxonomic groups, are amalgamated. Occupancy models work by modelling the site level data collection process that led to the input data to estimate the probability that a species unrecorded at a site was in fact there but unobserved. Doing so goes some way to combat biases introduced by variable recording practices, surveyor efforts, and survey goals, both within and between datasets. The method is applied to 771 species of Celtic Sea mollusc, over the period 1950-present day, and is implemented by adapting existing R packages, in particular sparta, to use a modified occupancy model allowing for dynamic site occupancy. This allows us to produce a trend line similar to the well-known Living Planet Index, indicating that molluscs appear to have declined in occupancy in the Celtic Sea. The method requires further testing, including ground-truthing trends against independent, systematic survey data for selected taxa, before results can be applied in a policy context. This is the topic of Alun Jones’s ongoing PhD research. However, the preliminary analysis shows that occupancy modelling does present a viable method of tracking changes in ocean biodiversity and identifying potentially at risk species, exploiting the largely underused temporal dimension in unstructured OBIS data. This could significantly expand the range of marine taxa that can be included in synthetic indices of the state of marine biodiversity.

### Marine Biodiversity Indices

Global marine biodiversity is changing in response to numerous threats from anthropogenic sources, including climate change, over exploitation and pollution. In the face of such threats, it is important that we develop robust marine biodiversity indices, both to monitor past and future change and to identify conservation priorities and at risk species and areas. While examples of marine biodiversity indices exist, for example the Living Blue Planet Report, they tend to rely by necessity on well studied systems with data-rich time series, typically economically important or charismatic vertebrate species. However, the extent to which trends in these relatively few systems represent marine biodiversity as a whole is largely unknown. Moreover, long-running data rich time series are few, and confined to specific geographic areas and taxa, therefore extrapolating more widely from these trends is potentially unwise. Achieving a taxanomically comprehensive and robust marine biodiversity index therefore requires that we employ methods to utilise all data available to us in order to maximize taxonomic and geographic representation.

OBIS represents the largest available repository of marine biodiversity data, and an ideal candidate for the production of such biodiversity indices, however to date these indices are still lacking. Part of the reason for this is likely due to the difficulty of drawing robust conclusions from OBIS trend data, as it is compiled from multiple separate data sources, each with their own methodologies, goals, biases, and gaps. Ensuring these datasets are comparable is an essential first step in assessing trends in marine biodiversity, however doing so is not a trivial task. One method for achieving this that is gaining popularity in terrestrial research is occupancy modelling (e.g. MacKenzie et al 2005, Isaac et al 2014). Occupancy models work by modelling the site level data collection process that led to the input data to estimate the probability that a species unrecorded at a site was in fact there but unobserved. Doing so goes some way to combat biases introduced by variable recording practices, surveyor efforts, and survey goals, both within and between datasets.

Here we present our ongoing work adapting occupancy modelling methods to marine data, and the steps required to process OBIS data into a form suitable for occupancy modelling - here using molluscs in the Celtic Sea as an example.

### Methodology

The first stage in the process is to download all datasets from OBIS in their entirety that contain Celtic Sea mollusc records. The justification behind downloading the complete datasets that contain molluscs is to achieve a dataset receptive to occupancy modelling. One assumption made when running the model is that if a survey records molluscs at one site, then it is likely that they would have recorded molluscs at any other site they were observed. Therefore sites in these surveys exist at any given time in one of three states: molluscs present, molluscs present but unrecorded, and molluscs absent, but are only recorded as being in one of two states: molluscs present or molluscs absent. The purpose of the modelling framework is to estimate the probability that sites recorded as ‘molluscs absent’ are actually in the state ‘molluscs present but unrecorded’. It is more difficult to justify the same assumption of the recorders for datasets that have no mollusc records.

Since data availability drops significantly in these datasets before 1950, only records from 1950 onward are retained for analysis. The records are then assigned site identifiers based on their position on a 1x1 degree latitude-longitude grid:

Site are assigned because occupancy models work on the site level, estimating probabilities of site-level presence or absence, and site-level species lists are required for the model to run successfully. We choose here to use a 1 degree site grid because this presents the best compromise between geographic resolution and data availability. Smaller sites typically lack the data required for the model to output credible results, and larger sites are less informative. Because of these competing factors, we feel that 1 degree is a reasonable compromise.

Having retrieved, cleaned and processed the data into sites, the list of taxa included in the data undergoes taxonomic cleaning using the R package taxize.

#install.packages("devtools")
#devtools::install_github("ropensci/taxize")
library(taxize)

tname_out <- gnr_resolve(names = molluscs$tname)  All taxa identified by taxize to have a match score of greater than 0.9 are taken as correct. Taxa names with a score of less than 0.9 are then run through the WoRMS taxa matching system as a secondary test. Records for unknown or ambiguous taxa are removed. #### Closure periods and occupancy modelling Occupancy modelling can be thought of as working on two time scales: the ‘visit’ and ‘year’ time scales. Multiple visits to a site are conducted within a year, and the assumption of a year being a ‘closed’ period means that within a year there is no colonization or extinction at a site. In the typical biological survey data that occupancy modelling was designed for, these visits and years referred to actual visits to a site in particular years, however due to the nature of OBIS data, we need to be more liberal in the definitions of ‘visit’ and ‘year’ used. To get the most robust results possible out of the OBIS data we use here, we have assigned records to artificial ‘years’ of approximately equal data availability. For example, the period of 1950-1985 has approximately the same amount of data as the year 2006, and as such these two periods are defined as their own artificial ‘years’ for the purpose of this analysis. In this case, we also removed from analysis two sites with a high proportion (>33%) of ‘years’ with no data. Next, OBIS data are formatted for occupancy modelling using the R package sparta (sparta at github, based on Isaac et al 2014): #install.packages("devtools") #devtools::install_github("biologicalrecordscentre/sparta") library(sparta) molluscs <- read.csv("molluscs.csv") head(molluscs)  ## clean_name site analysis_date ## 1 Coryphaenoides rupestris 12 1-09-01 ## 2 Dictenophiura carnea 13 1-09-01 ## 3 Echinocyamus pusillus 13 1-09-01 ## 4 Lepidion lepidion 12 1-09-01 ## 5 Labidoplax digitata 13 1-09-01 ## 6 Psilaster andromeda 12 1-09-01  occData <- formatOccData(taxa = molluscs$clean_name,
site = molluscs$site, time_period = as.Date(molluscs$analysis_date))

str(occData)

##  List of 2
##   $spp_vis :'data.frame': 1433 obs. of 4974 variables: ## ..$ visit                                        : chr [1:1433] "110001-03-01" "110001-06-01" ...
##    ..$Abietinaria : logi [1:1433] FALSE FALSE FALSE ... ## ..$ Abietinaria abietina                         : logi [1:1433] FALSE FALSE FALSE ...
##    ..$Abietinaria filicula : logi [1:1433] FALSE FALSE FALSE ... ## ..$ Abludomelita                                 : logi [1:1433] FALSE FALSE FALSE ...
##    ..$Abludomelita gladiosa : logi [1:1433] FALSE FALSE FALSE ... ## ..$ Abludomelita obtusata                        : logi [1:1433] FALSE FALSE FALSE ...
##    ..$Abra : logi [1:1433] FALSE FALSE FALSE ... ## ..$ Abra alba                                    : logi [1:1433] FALSE FALSE FALSE ...
##    ..$Abra longicallus : logi [1:1433] FALSE FALSE FALSE ... ## .. [list output truncated] ##$ occDetdata:'data.frame':	1433 obs. of  4 variables:
##    ..$visit: chr [1:1433] "110001-03-01" "110001-06-01" ... ## ..$ site : num [1:1433] 11 11 11 11 11 ...
##    ..$L : int [1:1433] 29 34 56 336 9 ... ## ..$ year : num [1:1433] 1 1 1 2 2 ...


Doing so results in an R object containing 2 dataframes to be passed to sparta: spp_vis that defines species presence or absence at each site/date combination (or “visit”), and occDetdata that defines species list length L at each visit. This data is then used to run the sparta occupancy model, via the occDetFunc function, by passing the function a list of the taxa that we want to model, for example:

taxa <- as.character(c("Limacina retroversa",
"Nucella lapillus",
"Loligo forbesii",
"Calliostoma zizyphinum"))

occ_out<-NULL

system.time(
for (i in 1:length(taxa)){

occ_out[[i]] <- occDetFunc(taxa_name = taxa[i],
occDetdata = occData$occDetdata, spp_vis = occData$spp_vis,
n_iterations = 50000,
burnin = 5000,
n_chains = 2,
model.file = occDetModel_mod)

}
)

##      user   system  elapsed
##  1646.554   12.391 1685.961


The occDetFunc calls the occupancy detection model included in the sparta package, however in our case we specify a modified model that allows for dynamic site occupancy. In this example the model is run for a subset of Celtic Sea mollusc species - be warned that running this for large selections of taxa, large numbers of sites, or large datasets in general is computationally and time intensive.

### Output

Here we can see a truncated example of the data output by sparta:

head(occ_out[[1]]$BUGSoutput$summary)

##                        mean          sd          2.5%           25%
##  LL.p          5.770886e-02  0.06432965   -0.07088525    0.01490325
##  deviance      1.283903e+03 19.63791983 1246.34788284 1270.60604526
##  mu.lp         1.424553e-01  0.39250597   -0.62676159   -0.11221911
##  pdet.alpha[1] 8.467016e-01  0.05114728    0.73123753    0.81565531
##  pdet.alpha[2] 7.666418e-01  0.06317812    0.63063767    0.72633632
##  pdet.alpha[3] 6.813377e-01  0.07145600    0.53192257    0.63441827
##                         50%          75%        97.5%     Rhat n.eff
##  LL.p          5.854692e-02    0.1006938    0.1841080 1.001767  1900
##  deviance      1.283465e+03 1296.7990638 1323.8365164 1.001089 12000
##  mu.lp         1.395818e-01    0.3978021    0.9216081 1.001168  7400
##  pdet.alpha[1] 8.524120e-01    0.8839264    0.9298976 1.001056 16000
##  pdet.alpha[2] 7.713311e-01    0.8123690    0.8760780 1.000978 30000
##  pdet.alpha[3] 6.846930e-01    0.7322488    0.8102692 1.001291  4600


And trends for the four species in the example code above:

By running this model for all 771 species of mollusc that we have data for, we can have a (albeit slightly confused) glimpse at general molluscs trends:

Or we can take an approach more akin to the Living Planet Index - percentage change compared to a baseline. Here we assess the percentage change in mean species occupancy compared to a pre-1985 baseline, finding a ~22% decrease in site occupancy by Celtic Sea molluscs since 1985:

This suggests that species on average occupy ~22% fewer Celtic Sea sites that they did 30 years ago. The reason for this is open to speculation (e.g.: range shifts, over-exploitation), and further monitoring and analysis is warranted to both validate and determine the cause (or causes) of these trends. Likewise, comparing these trends to those obtained from independent systematic surveys of a small number of taxa would further validate the results presented here. However this analysis does suggest that, while different species may be changing in occupancy in very different ways, molluscs overall occur in less sites now than they did 30 years ago. The great strength of occupancy modelling is its flexibility, and in future we intend to capitalize on this by including covariates in the model to achieve more accurate predictions. However, even in this rough preliminary analysis we can see that occupancy modelling does present a viable method of tracking changes in ocean biodiversity and identifying potentially at risk species, using as-of-yet under-utilised and unstructured marine biodiversity data.

For questions on the methods applied please contact Alun Jones AHJones1@sheffield.ac.uk. The full code of the modified model will be publicly available after publication.

This post was supported by the DIPS-4-Ocean Assessments project (funded through the Flanders UNESCO Science Trust Fund), which aims to develop biodiversity indices based on OBIS to support global assessments on the state of the marine environment.