Introduction to vernabota

This document describes the methods and use of the package vernabota to gapfill missing botanical names using vernacular names, in the case of Guyafor census data. The objective is to obtain a chosen number of simulated communities for which individuals only identified with a vernacular name are given a botanical name based on probabilities of association of vernacular and botanical names. It is largely based on the work and codes from Aubry-Kientz et al. (2013) and Mirabel (2018).

The models are described here.

We set a seed for reproducibility.

set.seed(56)

Some of the examples below require the package data.table.

library(data.table)

Preparing the data

Data that we want to gapfill

This algorithm works on a dataset formatted as it is when obtained using the function EcoFoG::Guyafor2df or from the online data platform of Paracou.

There can be several censuses for a same plot (i.e. several lines per individual trees).

Here, we take the example of data from plot 6, census of 2016, and use subplot 1 as the dataset that we want to gapfill. We call this dataset Data2fill.

In this dataset, the column VernName should not contain any special character such as é, è or œ (data from the Guyafor database should not have these special characters).

We use the function PrepData to prepare the data.

library(vernabota)
data(Paracou6_2016)
Data2fill <- Paracou6_2016[Paracou6_2016$SubPlot==1,]
Data2fill <- PrepData(Data2fill)
str(Data2fill)
#> Classes 'data.table' and 'data.frame':   976 obs. of  26 variables:
#>  $ Forest             : chr  "Paracou" "Paracou" "Paracou" "Paracou" ...
#>  $ Plot               : int  6 6 6 6 6 6 6 6 6 6 ...
#>  $ PlotArea           : num  6.25 6.25 6.25 6.25 6.25 6.25 6.25 6.25 6.25 6.25 ...
#>  $ SubPlot            : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ idTree             : Factor w/ 976 levels "100621","100622",..: 1 2 3 4 5 6 7 8 9 10 ...
#>  $ Xfield             : num  23 28.5 14 9.5 10 5 5 3.5 5 8.5 ...
#>  $ Yfield             : num  236 248 210 126 128 ...
#>  $ Xutm               : num  286421 286423 286418 286435 286435 ...
#>  $ Yutm               : num  583171 583185 583144 583062 583063 ...
#>  $ Lat                : num  5.27 5.27 5.27 5.27 5.27 ...
#>  $ Lon                : num  -52.9 -52.9 -52.9 -52.9 -52.9 ...
#>  $ Family             : chr  "Euphorbiaceae" "Arecaceae" "Sapotaceae" "Humiriaceae" ...
#>  $ Genus              : chr  "Sandwithia" "Oenocarpus" "Micropholis" "Sacoglottis" ...
#>  $ Species            : chr  "guyanensis" "bataua" "guyanensis" "guianensis" ...
#>  $ BotaSource         : chr  "Bota" "Bota" "Bota" "Bota" ...
#>  $ BotaCertainty      : Factor w/ 6 levels "-1","0","1","2",..: 6 6 6 6 6 6 6 5 6 6 ...
#>  $ VernName           : Factor w/ 110 levels "-","acacia franc",..: 102 79 14 91 18 97 14 86 75 70 ...
#>  $ CensusYear         : int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
#>  $ CensusDate         : chr  "2016-09-14" "2016-09-14" "2016-09-14" "2016-09-14" ...
#>  $ CensusDateCertainty: int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ CodeAlive          : int  1 1 1 1 1 1 1 1 1 0 ...
#>  $ MeasCode           : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ Circ               : num  42 59.5 74 132.5 46 ...
#>  $ CircCorr           : num  42 59.5 74 132.5 46 ...
#>  $ CorrCode           : chr  "0" "0" "0" "0" ...
#>  $ GenSp              : Factor w/ 188 levels "Abarema-jupunba",..: 148 116 107 147 169 186 128 74 117 88 ...
#>  - attr(*, ".internal.selfref")=<externalptr> 
#>  - attr(*, "index")= int(0) 
#>   ..- attr(*, "__BotaSource")= int [1:976] 1 2 3 4 5 6 7 8 9 10 ...

Prior: expert knowledge on possible associations

The prior is a dataframe with vernacular names in columns and botanical names in rows (given in 3 column Family, Genus and Species. For a given vernacular name and a given botanical name, the value is 1 if the association is possible, according to expert knowledge, and 0 if not.

We propose three prior files resulting from the work of Jean-Maurice Madkaud (2012), updated using the code 5_Dev/Prior_Verna_Bota_Name_Cleaning/Prior_Verna_Bota_Name_Cleaning.Rmd in January 2022.

data(PriorAllFG_20220126)
PriorAllFG <- PriorAllFG_20220126
str(PriorAllFG[,1:10])
#> 'data.frame':    1657 obs. of  10 variables:
#>  $ Family          : chr  "Fabaceae" "Chrysobalanaceae" "Melastomataceae" "Opiliaceae" ...
#>  $ Genus           : chr  "Abarema" "Acioa" "Aciotis" "Agonandra" ...
#>  $ Species         : chr  "jupunba" "guianensis" "purpurascens" "silvatica" ...
#>  $ PresentInGuyaFor: logi  TRUE FALSE FALSE TRUE TRUE FALSE ...
#>  $ acacia franc    : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ acajou de guyane: num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ adugue          : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ aganananga      : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ aganiamai       : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ agui            : num  0 0 0 0 0 0 0 0 0 0 ...

data(PriorParacouNew_20220126)
PriorParacouNew <- PriorParacouNew_20220126
# str(PriorParacouNew[,1:10])

# data(PriorParacouOld_20220126)
# PriorParacouOld <- PriorParacouOld_20220126
# str(PriorParacouOld[,1:10])

We use the function PrepPrior to prepare the prior. Here we use the default settings because we want to remove the botanical names with non-determined species, and the botanical names not in Guyafor from the prior. The reason is that these names would always lead to incorrect association when using the CompareSim function (see below). However, one may decide to keep them, this would lead to

possible associations with a botanical name of the form Genus-Indet. with a BotaCodeCor="AssoByGenus" or ="AssoByFam" (with RemoveIndetSp==TRUE) .
possible associations with a botanical name that has never been observed in Guyafor (with RemoveNotGuyafor==TRUE).

In these latter cases, only the prior information would be used.

PriorAllFG <- PrepPrior(PriorAllFG)
str(PriorAllFG[,1:10])
#> Classes 'data.table' and 'data.frame':   681 obs. of  10 variables:
#>  $ Family          : chr  "Fabaceae" "Opiliaceae" "Lauraceae" "Lauraceae" ...
#>  $ Genus           : chr  "Abarema" "Agonandra" "Aiouea" "Aiouea" ...
#>  $ Species         : chr  "jupunba" "silvatica" "guianensis" "laevis" ...
#>  $ acacia franc    : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ acajou de guyane: num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ adugue          : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ aganananga      : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ aganiamai       : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ agui            : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ agusiton        : num  0 0 0 0 0 0 0 0 0 0 ...
#>  - attr(*, ".internal.selfref")=<externalptr>

PriorParacouNew <- PrepPrior(PriorParacouNew)
# str(PriorParacouNew[,1:10])
# 
# PriorParacouOld <- PrepPrior(PriorParacouOld)
# str(PriorParacouOld[,1:10])

Observation data to update the prior

To build the matrix of association between vernacular and scientific names, we can either use the same dataset than the one for which we want to perform the association or another dataset. The user needs to carefully think this choice through. Using the same dataset can lead to underestimating diversity as it consider that there cannot be any dispersal of species from outside. Using a too wide data set could lead to associating species that are not present in the area.

There can be several censuses for a same plot (i.e. several lines per individual trees).

In this dataset, the column VernName should not contain any special character such as é, è or œ (data from the Guyafor database should not have these special characters).

Here, we use data from plot 6 (all four subplots), census of 2016. We call this dataset DataAsso.

We use the function PrepData to prepare the data.

DataAsso <- Paracou6_2016
DataAsso <- PrepData(DataAsso)
str(DataAsso)
#> Classes 'data.table' and 'data.frame':   3620 obs. of  26 variables:
#>  $ Forest             : chr  "Paracou" "Paracou" "Paracou" "Paracou" ...
#>  $ Plot               : int  6 6 6 6 6 6 6 6 6 6 ...
#>  $ PlotArea           : num  6.25 6.25 6.25 6.25 6.25 6.25 6.25 6.25 6.25 6.25 ...
#>  $ SubPlot            : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ idTree             : Factor w/ 3620 levels "100621","100622",..: 1 2 3 4 5 6 7 8 9 10 ...
#>  $ Xfield             : num  23 28.5 14 9.5 10 5 5 3.5 5 8.5 ...
#>  $ Yfield             : num  236 248 210 126 128 ...
#>  $ Xutm               : num  286421 286423 286418 286435 286435 ...
#>  $ Yutm               : num  583171 583185 583144 583062 583063 ...
#>  $ Lat                : num  5.27 5.27 5.27 5.27 5.27 ...
#>  $ Lon                : num  -52.9 -52.9 -52.9 -52.9 -52.9 ...
#>  $ Family             : chr  "Euphorbiaceae" "Arecaceae" "Sapotaceae" "Humiriaceae" ...
#>  $ Genus              : chr  "Sandwithia" "Oenocarpus" "Micropholis" "Sacoglottis" ...
#>  $ Species            : chr  "guyanensis" "bataua" "guyanensis" "guianensis" ...
#>  $ BotaSource         : chr  "Bota" "Bota" "Bota" "Bota" ...
#>  $ BotaCertainty      : Factor w/ 6 levels "-1","0","1","2",..: 6 6 6 6 6 6 6 5 6 6 ...
#>  $ VernName           : Factor w/ 156 levels "-","acacia franc",..: 146 112 23 131 29 138 23 123 105 100 ...
#>  $ CensusYear         : int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
#>  $ CensusDate         : chr  "2016-09-14" "2016-09-14" "2016-09-14" "2016-09-14" ...
#>  $ CensusDateCertainty: int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ CodeAlive          : int  1 1 1 1 1 1 1 1 1 0 ...
#>  $ MeasCode           : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ Circ               : num  42 59.5 74 132.5 46 ...
#>  $ CircCorr           : num  42 59.5 74 132.5 46 ...
#>  $ CorrCode           : chr  "0" "0" "0" "0" ...
#>  $ GenSp              : Factor w/ 321 levels "Abarema-jupunba",..: 253 193 175 251 289 316 215 118 195 142 ...
#>  - attr(*, ".internal.selfref")=<externalptr> 
#>  - attr(*, "index")= int(0) 
#>   ..- attr(*, "__BotaSource")= int [1:3620] 1 2 3 4 5 6 7 8 9 10 ...

Running some simulations using the function SimFullCom

NB: for these examples, a low number of simulations is used. For real tests, a higher number of simulations should be performed.

The SimFullCom function returns the original dataset with two additional columns:

GensSpCor: The Genus and species after gap filling.
BotaCorCode : the type of correction (see section Possible types of gapfilling in this vignette, and the help of the SimFullCom function).

In cases where the original data contained several censuses of a same plots (i.e. several lines per individual trees), the output keeps the several censuses. For a given simulations, all observations of a same tree have the same botanical names associated.

Example 1: using the same dataset for Data2fill and DataAsso, without prior

DataNSim <- SimFullCom(Data2fill, NSim=2, eps=0.01)
str(DataNSim, max.level = 1)
#> List of 2
#>  $ :Classes 'data.table' and 'data.frame':   976 obs. of  28 variables:
#>   ..- attr(*, ".internal.selfref")=<externalptr> 
#>   ..- attr(*, "sorted")= chr "idTree"
#>  $ :Classes 'data.table' and 'data.frame':   976 obs. of  28 variables:
#>   ..- attr(*, ".internal.selfref")=<externalptr> 
#>   ..- attr(*, "sorted")= chr "idTree"
colnames(DataNSim[[1]])
#>  [1] "idTree"              "Forest"              "Plot"               
#>  [4] "PlotArea"            "SubPlot"             "Xfield"             
#>  [7] "Yfield"              "Xutm"                "Yutm"               
#> [10] "Lat"                 "Lon"                 "Family"             
#> [13] "Genus"               "Species"             "BotaSource"         
#> [16] "BotaCertainty"       "VernName"            "CensusYear"         
#> [19] "CensusDate"          "CensusDateCertainty" "CodeAlive"          
#> [22] "MeasCode"            "Circ"                "CircCorr"           
#> [25] "CorrCode"            "GenSp"               "GensSpCor"          
#> [28] "BotaCorCode"
table(DataNSim[[1]]$BotaCorCode)
#> 
#>    fullyDet   Det2Genus       NoCor   AssoByFam  AssoByVern     Det2Fam 
#>         953           2           1           2          11           1 
#> AssoByGenus 
#>           6

Example 2: using different dataset for Data2fill and DataAsso, with a prior (different weighing of the prior and the observations)

Here we have a weight of 0.2 for the prior and of 0.8 for the observations.

DataNSim <- SimFullCom(Data2fill=Data2fill, DataAsso=DataAsso, 
                       prior=PriorAllFG, wp=0.2, NSim=2, eps=0.01)
#str(DataNSim, max.level = 1)
#colnames(DataNSim[[1]])
table(DataNSim[[1]]$BotaCorCode)
#> 
#>    fullyDet   AssoByFam  AssoByVern     Det2Fam   Det2Genus AssoByGenus 
#>         953           2          12           1           1           7

Example 3: getting the more likely associations (using `Determ=TRUE`)

As we want to simulate the more likely associations, we set NSim to 1.

DataNSim <- SimFullCom(Data2fill=Data2fill, DataAsso=DataAsso, 
                       prior=PriorAllFG, wp=0.2, NSim=1, eps=0.01, Determ=TRUE)
#str(DataNSim, max.level = 1)
#colnames(DataNSim[[1]])
table(DataNSim[[1]]$BotaCorCode)
#> 
#>          fullyDet   AssoByFamDeterm  AssoByVernDeterm           Det2Fam 
#>               953                 2                12                 1 
#>         Det2Genus AssoByGenusDeterm 
#>                 1                 7

Comparing different settings for the simulations using the function CompareSim

See the article.

Bibliography

Aubry-Kientz, Mélaine, Bruno Hérault, Charles Ayotte-Trépanier, Christopher Baraloto, and Vivien Rossi. 2013. “Toward Trait-Based Mortality Models for Tropical Forests.” Edited by Francesco de Bello. PLoS ONE 8 (5): e63678. https://doi.org/10.1371/journal.pone.0063678.

Madkaud, Jules-Maurice. 2012. “Mettre à plat les correspondances entre noms vernaculaires et identités botaniques des espèces présentes sur le site de Paracou (Guyane Française).” PhD thesis, Université des Antilles et de la Guyane.

Mirabel, Ariane. 2018. “Réponse et Résilience de la Biodiversité d’une Forêt Tropicale après Perturbation.” PhD thesis, Université de Guyane.