The purpose is to build a shiny app that allows user to explore habitat shapes of trees in Europe. These habitat shapes are plots of existing trees from a merged database of close to 9 million trees. We extracted climatic variables present at each tree location. These shapes allow users to compare with their own location, a potential planting site, for the past and future.
Here’s an example what we mean with tree suitability: The scatter points are tree occurrences, with points from one location in te Harz mountains at different points in time, and 2070 projected. That place has already fallen out of the habitat of picea abies.
Infrastructure
The docker-compose file outlines the containers and is a good reference to start.
This container runs through the main script, fetching tree location data, getting bioclimatic variables, extracting those from the tree locations and writing everything to Postgis.
Postgis Database Container
All processed data from the ETL pipeline ends up in here in a few tables. We chose not to write rasters to postgis.
Shiny Container
The actual app runs as docker swarm and connects to the Postgis database.
Treeful ETL Pipeline: Obtaining and transforming tree occurrence data
This section documents how tree occurrence data is obtained. At the end, we will have our tree locations with the corresponding bioclimatic variables in a postgis database.
Get Climate Projection data & other external datasets
First, we obtain data not included in this repository, namely EU Forest and Copernicus CDS raster files. These downloads will take a while. (Some datasets are neither downloaded, nor included in this repo. They’ve been shared privately with us)
We have used Copernicus CMIP5 regional projections, with RPC4.5 and RPC8.5 experiments from the NorESM1-M (NCC, Norway) model. The datasets can be built and obtained here. Our code contains a short helper script for the automated download (set env vars for CDS login before)
Code
#################### Trees4F from figshare ###################if (!file.exists("2_Data/0_raw_data/EUforestspecies_AMauri.csv")) {download.file("https://springernature.figshare.com/ndownloader/files/6662535", destfile ="2_Data/0_raw_data/EUforestspecies_AMauri.csv")}#################### COPERNICUS CDS Download #################### REGISTER YOURSELF AND ENTER DETAILS FROM HERE:# https://cds.climate.copernicus.eu/user/register# Beware the user is the UID at the bottom of your user page.uid <-Sys.getenv("COPERNICUS_UID")wf_set_key(user=uid, # key=Sys.getenv("COPERNICUS_KEY"), service="cds")request <-list(region ="europe",variable =c("annual_mean_temperature", "annual_precipitation", "isothermality", "maximum_temperature_of_warmest_month", "mean_diurnal_range", "mean_temperature_of_coldest_quarter", "mean_temperature_of_driest_quarter", "mean_temperature_of_warmest_quarter", "mean_temperature_of_wettest_quarter", "minimum_temperature_of_coldest_month", "precipitation_of_coldest_quarter", "precipitation_of_driest_month", "precipitation_of_driest_quarter", "precipitation_of_warmest_quarter", "precipitation_of_wettest_month","precipitation_of_wettest_quarter", "precipitation_seasonality", "temperature_annual_range", "temperature_seasonality", "volumetric_soil_water", "aridity"),derived_variable ="annual_mean",model ="noresm1_m",ensemble_member ="r1i1p1",experiment =c("rcp4_5", "rcp8_5"),statistic ="mean",version ="1.0",format ="zip",dataset_short_name ="sis-biodiversity-cmip5-regional",target ="download.zip")ncfile <-wf_request(user = uid,request = request, transfer =TRUE, path ="2_Data/0_raw_data/future",verbose =FALSE)request <-list(region ="europe",origin ="era5",variable =c("annual_mean_temperature", "annual_precipitation", "isothermality", "maximum_temperature_of_warmest_month", "mean_diurnal_range", "mean_temperature_of_coldest_quarter", "mean_temperature_of_driest_quarter", "mean_temperature_of_warmest_quarter", "mean_temperature_of_wettest_quarter", "minimum_temperature_of_coldest_month", "precipitation_of_coldest_quarter", "precipitation_of_driest_month", "precipitation_of_driest_quarter", "precipitation_of_warmest_quarter", "precipitation_of_wettest_month", "precipitation_of_wettest_quarter", "precipitation_seasonality", "temperature_annual_range", "temperature_seasonality", "volumetric_soil_water"),derived_variable ="annual_mean",statistic ="mean",version ="1.0",format ="zip",dataset_short_name ="sis-biodiversity-era5-regional",target ="download.zip")ncfile <-wf_request(user = uid,request = request, transfer =TRUE, path ="2_Data/0_raw_data/past",verbose =FALSE)if (file.exists("2_Data/0_raw_data/past/download.zip")) { utils::unzip("2_Data/0_raw_data/past/download.zip", exdir ="2_Data/0_raw_data/past/")file.remove("2_Data/0_raw_data/past/download.zip")}if (file.exists("2_Data/0_raw_data/future/download.zip")) { utils::unzip("2_Data/0_raw_data/future/download.zip", exdir ="2_Data/0_raw_data/future/")file.remove("2_Data/0_raw_data/future/download.zip")}
Getting Tree Location Data
To produce reliable climate envelopes we would like to obtain as many trees from various biogeographical regions of Europe. This section details how we process them and produces a database with these occurrences:
Note
We are not doing any sampling on tree occurrences. This means that small areas with many tree records will produce a higher density in the climate envelopes. In the current state of our app and plots, this does reflect to the user.
Academic data sources
We used the TRY plant trait database and EU Forest. Both contain tree locations and botanical names. Both are academic, high-quality datasets with extensive data cleaning in place. We used EU Forest to generate our master list of botanical names. These trees will subsequently be used. A simple fuzzy matching is used to ensure Sorbus Torminalis and Sorbus-torminalis are matched as Sorbus torminalis (We used a conservative string dist=1).
Code
############ Load Packages ############ if(!require(librarian)) install.packages("librarian")library(librarian)shelf(data.table,stringr,stringdist,fuzzyjoin,dplyr)############ Define Function for fuzzy matching ############ flatten_names <-function(tree_name) { tree_name <-str_replace_all(tree_name," ", "") %>%str_replace_all(.,"\\.","") %>%str_replace_all(.,"\\-","") %>%str_replace_all(.,"\\'","") %>%tolower(.)}get_fuzzy_plant_names <-function(plant_database, master_list, max_string_dist=3){# Prepare for matching# Extract unique namesdatabase_names <- plant_database[,.(latin_name=unique(latin_name))]# Flatten namesmaster_list[,flat_name:=flatten_names(latin_name)]database_names[,flat_name:=flatten_names(latin_name)]# Matchstringdist <-as.data.table(stringdist_join(database_names, master_list, by =c("flat_name"="flat_name"), mode ="inner", method ="dl", max_dist = max_string_dist, distance_col ="dist"))# Clean matchingstringdist <- stringdist[,.(data_set_name=latin_name.x,master_list_name=latin_name.y,distance=dist)]stringdist <-setorder(stringdist,-distance)return(stringdist)# As an example of possible other matching functions:# amatch <- amatch(database_names[,flat_name], master_list[,flat_name], maxDist = 0.1)}############ MAIN: Create master list of botanical names ############ # Load datasets# Use trees4EU as master list, more coveragemaster_list <- data.table::fread("2_Data/0_raw_data/EUforestspecies_AMauri.csv") %>% janitor::clean_names() %>% dplyr::select(latin_name = species_name) %>%distinct()master_list <-na.omit(master_list)############ OPEN TREES DATA BASE ############ # Load and prepare database# Load databaseopen_trees_db <-fread("2_Data/1_output/all_merged.csv")setnames(open_trees_db,"species","latin_name")open_trees_db <-na.omit(open_trees_db)# Match db namesmax_string_dist =1open_trees_matching_table<-get_fuzzy_plant_names(open_trees_db, master_list, max_string_dist)open_trees_matching_table_selection <- open_trees_matching_table[distance<=1,]# Filter databaseopen_trees_db_selection <- open_trees_db[open_trees_matching_table_selection[,.(data_set_name,master_list_name)], on =c(latin_name ="data_set_name"), nomatch =NULL]# open_trees_db_selection2 <- open_trees_db[latin_name %in% open_trees_matching_table_selection[,data_set_name]]############ TRY DATA BASE ############ # Load databaseif (file.exists("2_Data/0_raw_data/tree_georef_3.txt")) { try_trees <- data.table::fread("2_Data/0_raw_data/tree_georef_3.txt") %>% janitor::clean_names()setnames(try_trees,"acc_species_name","latin_name") try_trees <-na.omit(try_trees)# Match db names max_string_dist =1 try_trees_matching_table<-get_fuzzy_plant_names(try_trees, master_list, max_string_dist) try_trees_matching_table_selection <- try_trees_matching_table[distance<=1,]# Filter database try_trees_selection <- try_trees[try_trees_matching_table_selection[,.(data_set_name,master_list_name)], on =c(latin_name ="data_set_name"), nomatch =NULL]# open_trees_db_selection2 <- open_trees_db[latin_name %in% open_trees_matching_table_selection[,data_set_name]]} else { try_trees_selection <-tibble(master_list_name =character(), tree_georef_1_std_value =numeric(),obs_data_std_value =numeric())}############ Trees4F DATA BASE ############ # Load and prepare database# Load databasetrees4f_db <- data.table::fread("2_Data/0_raw_data/EUforestspecies_AMauri.csv") %>% janitor::clean_names()setnames(trees4f_db,"species_name","latin_name")trees4f_db <-na.omit(trees4f_db)# Match db namesmax_string_dist =3trees4f_db_matching_table<-get_fuzzy_plant_names(trees4f_db, master_list, max_string_dist)trees4f_db_matching_table_selection <- trees4f_db_matching_table[distance<=1,]# Filter databasetrees4f_db_selection <- trees4f_db[trees4f_db_matching_table_selection[,.(data_set_name,master_list_name)], on =c(latin_name ="data_set_name"), nomatch =NULL]# open_trees_db_selection2 <- open_trees_db[latin_name %in% open_trees_matching_table_selection[,data_set_name]]# enhance master list with GBIF taxo IDs and write master list to filetree_master_list <- master_list %>%mutate(gbif_taxo_id =name_backbone_checklist(name=.$latin_name)$usageKey) %>%# remove unmatched or genus level taxo matchesfilter(str_length(gbif_taxo_id) >5&!is.na(gbif_taxo_id))fwrite(tree_master_list,"2_Data/1_output/eu_native_trees_master.csv")tree_master_list <-fread("2_Data/1_output/eu_native_trees_master.csv") rm(open_trees_db, open_trees_matching_table, open_trees_matching_table_selection, try_trees, try_trees_matching_table, try_trees_matching_table_selection, trees4f_db, trees4f_db_matching_table, trees4f_db_matching_table_selection )#EOF
We finish this file with rgbif::name_backbone_checklist() where we obtain the GBIF taxo ID for each botanical name. This is part of our master data list.
European Tree Cadastres
We wanted to include the proliferating corpus of tree cadastres of European cities. We collected two dozen or so data sources into this file. Since they’re all unharmonized, extensive data cleaning happens here (we did not deal with CRS transformations but only kept datasets with EPSG:4326):
Code
############## This script will load various tree location data into one. # Goal: have all in one db with harmonized botanical species name, X and Y coordinate in EPSG 4326tree_dbs <-read_xlsx("2_Data/0_raw_data/opendata_trees.xlsx") %>% janitor::clean_names() %>%filter(suitable =="y"& epsg =="4326") %>%mutate(location = janitor::make_clean_names(location)) %>%mutate(botanical_col =tolower(botanical_col)) # mutate(lon_col = janitor::make_clean_names(lon_col)) %>% # mutate(lat_col = janitor::make_clean_names(lat_col)) tree_dbs <-data.table(tree_dbs)for (i in tree_dbs[,file_name]) { utils::unzip(zipfile=paste("2_Data/0_raw_data/tree_cadastres/zip/",i,".zip",sep=""))}for (i in1:nrow(tree_dbs)) {ifelse(str_detect(tree_dbs$file_name[i], "csv"),assign(tolower(tree_dbs$location[i]), janitor::clean_names(read_delim(paste0("2_Data/0_raw_data/tree_cadastres/", tree_dbs$file_name[i])))),assign(tolower(tree_dbs$location[i]), janitor::clean_names(bind_cols(jsonlite::read_json(paste0("2_Data/0_raw_data/tree_cadastres/", tree_dbs$file_name[i]), simplifyVector =TRUE)$features$properties, jsonlite::read_json(paste0("2_Data/0_raw_data/tree_cadastres/", tree_dbs$file_name[i]), simplifyVector =TRUE)$features$geometry) ) ) )print(i)}# # Compress all the csv files with gzip# tree_dbs <- data.table(tree_dbs)# for (i in tree_dbs[,file_name]) {# db <- fread(paste("2_Data/0_raw_data/tree_cadastres/",i,sep=""))# fwrite(db,paste("2_Data/0_raw_data/tree_cadastres/",i,".gz",sep=""),compress="gzip")# }# tree_dbs[,file_name:=paste(file_name,".gz",sep="")]# write.xlsx2(tree_dbs, "2_Data/0_raw_data/opendata_trees.xlsx", sheetName = "Sheet1", # col.names = TRUE, row.names = TRUE, append = FALSE)# # tree_dbs <- data.table(tree_dbs)# for (i in tree_dbs[,location]) {# print(i)# assign(tolower(i), paste(fread(paste("2_Data/0_raw_data/tree_cadastres/",tree_dbs[location==i,file_name],sep=""))))# }# too small data to automate. manually merge mutate#split_species <- filter(tree_dbs, str_detect(botanical_col, "\\+"))#mutate(tourcoing, merged_species = paste0(genre, " ", espece)) %>% write_csv(file = "data/opentrees/tourcoing.csv")merged_geo <-filter(tree_dbs, lon_col == lat_col)sep_geo <-filter(tree_dbs, lon_col != lat_col)######## unmerge merged geographiesfor (i in1:nrow(merged_geo)) { temp_db <-get(tolower(merged_geo$location[i]), envir =globalenv()) %>% dplyr::select(matches(paste0(merged_geo$botanical_col[i], "|", merged_geo$lon_col[i]), ignore.case =TRUE)) temp_db <- temp_db %>%separate(merged_geo$lon_col[i], into =c("x", "y"), sep =",") %>%mutate(x =str_remove(x, "c\\("), y =str_remove(y, "\\)")) %>%rename(species = merged_geo$botanical_col[i])assign(tolower(merged_geo$location[i]), temp_db)print(i)print(merged_geo$location[i])}# treat regular lat lon sourcesfor (i in1:nrow(sep_geo)) { temp_db <-get(tolower(sep_geo$location[i]), envir =globalenv()) %>% dplyr::select(matches(paste0(sep_geo$botanical_col[i], "|", sep_geo$lon_col[i], "|", sep_geo$lat_col[i]), ignore.case =TRUE)) temp_db <- temp_db %>%rename(species = sep_geo$botanical_col[i], x = sep_geo$lat_col[i],y = sep_geo$lon_col[i] )assign(tolower(sep_geo$location[i]), temp_db)print(i)print(sep_geo$location[i])}#bring it all together into one dfall_trees <-tibble()for (i in1:nrow(tree_dbs)) { temp_db <-get(tolower(tree_dbs$location[i]), envir =globalenv()) %>% dplyr::select(x, y, species) temp_db <- temp_db %>%mutate(x =as.numeric(x)) %>%mutate(y =as.numeric(y)) %>%mutate(db_city_origin = tree_dbs$location[i])assign(tolower(tree_dbs$location[i]), temp_db) all_trees <-bind_rows(all_trees, temp_db)print(i)print(tree_dbs$location[i])}all_trees_centroids <- all_trees %>%st_as_sf(coords =c("y", "x"), na.fail = F) %>%group_by(db_city_origin) %>%summarise(st_union(geometry)) %>%st_centroid() #### save it all on disk, 4 million treeswrite_csv(all_trees, "2_Data/1_output/all_merged.csv")#"specie|especi|nom_lat|latboomsoort|espece|#nom_cientific|wetenschappelijke_naam|classe|dendro_taxon_tid|# nome_scien|traeart|SPECIE|sortiment|essence_scient|gattung_lat|species|gattung|geo_point"# all tree lists saved as csv in data folder. why csv?# cause majority of opendata portal tree lists were in csv so its least cumbersome conversion# also csv means less characters and therefore less data when things become very large.
This writes a file with only botanical name, X, Y and source database for the next step.
GBIF
The largest amount of our tree data comes from GBIF. We simply query a bounding box of Europe for all species from our master list, enriched already with GBIF taxo IDs. You could re-generate the dataset with most recently added tree occurrences with this (GBIF takes a while to prepare the dataset and you need to have your env variables GBIF_EMAIL, GBIF_PWD, GBIF_USER configured):
gbif_download <-occ_download(pred_in("taxonKey", tree_master_list$gbif_taxo_id),#pred("taxonKey", 5284884),# this is the bounding box of europepred_within("POLYGON((-15 75,-15 30,40 30,40 75,-15 75))"),pred_lt("coordinateUncertaintyInMeters",1000), #downstream processing needs 1km accuracypred("hasCoordinate", TRUE),pred("hasGeospatialIssue", FALSE), # remove GBIF default geospatial issuespred("occurrenceStatus","PRESENT"),pred_gte("year", 1960), #only keep trees seen after 1960format ="SIMPLE_CSV")#occ_download_wait(gbif_download)
Getting Bioclimatic Variables
We tested CHELSA, worldclim and Copernicus CDS to get bioclimatic variables. We prefer Copernicus for their extensive documentation and the convenient time frame of past bioclimatic data, from 1979-2018. This is where we expect the largest overlap with the life span of our trees from the occurrence dataset.
Our functions getpastclimate() and getfutureclimate() can theoretically be used to switch bioclimatic data sources but we have not implemented those. These functions are used to read our Copernicus raster files.
Reading Climate Rasters into R
In order to extract bioclimatic variables from each species location, we’ll have to load raster files into R. The function described here will solve a few complexities:
We convert raster values into units fit for anlysis.
Stacked rasters need to be treated accordingly.
Efficiency matters greatly here, especially when it comes to RAM-efficient raster reading.
What could and probably should not be done here: CRS-reprojections. Sometimes you may get a different CRS. Reproject outside of your ETL pipeline and then read.
Code
##################### Function to Get climate Rasters ################### In theory they can easily be adjusted to other climate raster providers. For some, snippets exist in the functions already but probably wont work # in the entire workflow of treeful. All processing here is only tested with Copernicus# you pass a provider and a bioclimatic var like bio01 or bio13 to the function and it returns one raster. you can stack them later on. getpastclimate <-function(source ="copernicus", bioclim ="bio01") {if (source =="climateeu") { bio_path <-case_when(bioclim =="bio01"~"MAT", bioclim =="bio12"~"MAP", ) bio_raster <-raster(paste0("data/climateEU/Normal_1961-1990_bioclim/", bio_path, ".asc")) raster::crs(bio_raster) <-"+proj=aea +lat_0=30 +lon_0=10 +lat_1=43 +lat_2=62 +x_0=0 +y_0=0 +ellps=intl +units=m +no_defs +type=crs" bio_raster <- raster::projectRaster(bio_raster, crs =4326) }if (source =="copernicus") {# Get bioclimate data from copernicus. Download bioclimate file with login at https://cds.climate.copernicus.eu/ bio_path <-toupper(bioclim) bio_raster <- terra::rast(paste0("2_Data/0_raw_data/past/", bio_path, "_era5-to-1km_1979-2018-mean_v1.0.nc"))# convert bioclim as per copernicus documentation. for some reasone case_when does not work here. if (bioclim %in%c("bio01", "bio05", "bio06", "bio08", "bio09", "bio10", "bio11")) {bio_raster <- bio_raster -273.15 } elseif (bioclim =="bio12") {bio_raster <- bio_raster*3600*24*365*1000 } elseif (bioclim %in%c("bio13", "bio14")) {bio_raster <- bio_raster*3600*24*30.5*1000 } elseif (bioclim %in%c("bio16", "bio17", "bio18", "bio19")) {bio_raster <- bio_raster*3600*24*91.3*1000 }# a bit unclear if bio13-bio19 can and should also be comverted like bio12. probably not as theyre not on annual reference period }# worldclim would be great cause it fetches all 19 bioclimatic at once. and has a great time range from 1970-2000. # but units of each var are stange. if (source =="worldclim") { bio_path <-case_when(bioclim =="bio01"~"bio_1", bioclim =="bio02"~"bio_2", bioclim =="bio03"~"bio_3", bioclim =="bio04"~"bio_4", bioclim =="bio05"~"bio_5", bioclim =="bio06"~"bio_6", bioclim =="bio07"~"bio_7", bioclim =="bio08"~"bio_8", bioclim =="bio09"~"bio_9", bioclim =="bio10"~"bio_10", bioclim =="bio11"~"bio_11", bioclim =="bio12"~"bio_12", bioclim =="bio13"~"bio_13", bioclim =="bio14"~"bio_14", bioclim =="bio15"~"bio_15", bioclim =="bio16"~"bio_16", bioclim =="bio17"~"bio_17", bioclim =="bio18"~"bio_18", bioclim =="bio19"~"bio_19" ) bio_raster <-raster(paste0("2_Data/1_output/worldclim_cropped/wc2.1_30s_", bio_path, ".tif")) }if (source =="chelsa") { bio_path <-case_when(bioclim =="bio01"~"bio1", bioclim =="bio12"~"bio12" ) bio_raster <-raster(paste0("2_Data/1_output/CHELSA_cropped/CHELSA_", bio_path, "_1981-2010_V.2.1.tif"))if (bioclim =="bio01") {bio_raster <- raster::calc(bio_raster, function(x) { x /10-273.15 }) } elseif (bioclim =="bio12") { bio_raster <- raster::calc(bio_raster, function(x) { x /10}) } }return(bio_raster)rm(bio_raster)}getsoilproperties <-function(variable ="STU_EU_DEPTH_ROOTS") { soil_layer <- terra::rast(paste0("2_Data/0_raw_data/soil/", variable, "_4326.tif"))return(soil_layer)rm(soil_layer)}
Merging Tree Locations and Bioclimatic Variables
We now have millions of tree locations and rasters of past bioclimatic variables. We extract each bioclimatic variable for each tree location. For that, we first merge all tree data sources into a large simple feature object.
Code
#################### LOad DBs, merge, turn into SF, extract bioclim #############################if (!file.exists("2_Data/1_output/tree_db.csv")) {######################## turn name-matched data.tables into SF ################### trees4f_sf <- trees4f_db_selection %>% sf::st_as_sf(coords =c("x", "y"), crs =3035) %>% sf::st_transform(crs =4326) %>%mutate(db ="trees4f") open_trees_sf <- open_trees_db_selection %>%filter(!is.na(x) &!is.na(y)) %>%st_as_sf(coords =c("y", "x"), crs =4326) %>%mutate(db ="cadastres") try_trees_sf <- try_trees_selection %>%filter(!is.na(tree_georef_1_std_value) &!is.na(obs_data_std_value)) %>%st_as_sf(coords =c("obs_data_std_value", "tree_georef_1_std_value"), crs =4326) %>%mutate(db ="try") gbif_trees_sf <- gbif_trees %>% dplyr::select(taxonkey, x = decimallatitude, y = decimallongitude) %>%left_join(dplyr::select(tree_master_list, master_list_name = latin_name, gbif_taxo_id), by =c("taxonkey"="gbif_taxo_id")) %>%filter(!is.na(x) &!is.na(y) &!is.na(master_list_name)) %>%st_as_sf(coords =c("y", "x"), crs =4326) %>%mutate(db ="gbif")sendstatus("got all dbs and turned them into sf")rm(gbif_trees, trees4f_db_selection, open_trees_db_selection, try_trees_selection)######################## bind all three sources into one #################################### tree_dbs <-rbind(as.data.table(trees4f_sf)[, .(master_list_name, db, geometry)],as.data.table(open_trees_sf)[, .(master_list_name, db, geometry)],as.data.table(try_trees_sf)[, .(master_list_name, db, geometry)],as.data.table(gbif_trees_sf)[, .(master_list_name, db, geometry)] )rm(gbif_trees_sf, open_trees_sf, try_trees_sf, trees4f_sf, try_species)gc()######################### The heart of it all: getting bioclimatic vars for each tree ##########sendstatus(paste0("Cutting down tree db to species with n>1500, currently at ", nrow(tree_dbs))) tree_count <- tree_dbs %>%group_by(master_list_name) %>%summarise(n=n()) %>%filter(n>1500) tree_dbs <- tree_dbs %>%filter(master_list_name %in% tree_count$master_list_name)gc()sendstatus(paste0("Cut small species occurrences. Starting extraction for ", nrow(tree_dbs), " tree occurrences")) tree_dbs <- tree_dbs %>%st_as_sf(crs =4326) bioclim_stack <-c(getpastclimate(source ="copernicus", bioclim ="bio01"),getpastclimate(source ="copernicus", bioclim ="bio02"),getpastclimate(source ="copernicus", bioclim ="bio03"),getpastclimate(source ="copernicus", bioclim ="bio04"),getpastclimate(source ="copernicus", bioclim ="bio05"),getpastclimate(source ="copernicus", bioclim ="bio06"),getpastclimate(source ="copernicus", bioclim ="bio07"),getpastclimate(source ="copernicus", bioclim ="bio08"),getpastclimate(source ="copernicus", bioclim ="bio09"),getpastclimate(source ="copernicus", bioclim ="bio10"),getpastclimate(source ="copernicus", bioclim ="bio11"),getpastclimate(source ="copernicus", bioclim ="bio12"),getpastclimate(source ="copernicus", bioclim ="bio13"),getpastclimate(source ="copernicus", bioclim ="bio14"),getpastclimate(source ="copernicus", bioclim ="bio15"),getpastclimate(source ="copernicus", bioclim ="bio16"),getpastclimate(source ="copernicus", bioclim ="bio17"),getpastclimate(source ="copernicus", bioclim ="bio18"),getpastclimate(source ="copernicus", bioclim ="bio19") )sendstatus("read and stacked bioclim rasters. starting bioclim extraction") tree_dbs <- tree_dbs %>%mutate(terra::extract(bioclim_stack, ., ID = F)) rm(bioclim_stack)gc()##### soil data extraction begin soil_stack <-c(getsoilproperties("STU_EU_DEPTH_ROOTS"),getsoilproperties("STU_EU_T_CLAY"),getsoilproperties("STU_EU_S_CLAY"),getsoilproperties("STU_EU_T_SAND"),getsoilproperties("STU_EU_S_SAND"),getsoilproperties("STU_EU_T_SILT"),getsoilproperties("STU_EU_S_SILT"),getsoilproperties("STU_EU_T_OC"),getsoilproperties("STU_EU_S_OC"),getsoilproperties("STU_EU_T_BD"),getsoilproperties("STU_EU_S_BD"),getsoilproperties("STU_EU_T_GRAVEL"),getsoilproperties("STU_EU_S_GRAVEL"),getsoilproperties("SMU_EU_T_TAWC"),getsoilproperties("SMU_EU_S_TAWC"),getsoilproperties("STU_EU_T_TAWC"),getsoilproperties("STU_EU_S_TAWC"))sendstatus("read and stacked soil rasters. starting soil extraction") tree_dbs <- tree_dbs %>%mutate(terra::extract(soil_stack, ., ID = F)) %>%mutate(across(.cols =starts_with(c("BIO", "STU", "SMU")), ~round(.x, digits =2), .names ="{.col}"))rm(soil_stack)gc()# tree_dbs <- tree_dbs %>% # st_drop_geometry()################################ write it all to csv #################################sendstatus("saving DB to disk") data.table::fwrite(x = tree_dbs, file ="2_Data/1_output/tree_db.csv")} else {sendstatus("tree db exists, reading from disk/n") tree_dbs <-fread("2_Data/1_output/tree_db.csv")}con <-backend_con()if (!RPostgres::dbExistsTable(conn = con, name ="tree_dbs")) {# writing trees to postgres DBsendstatus("writing tree db to postgres") con <-backend_con() sf::st_write(tree_dbs, dsn = con, table ="tree_dbs",append =FALSE)}tree_db_sample_size <-group_by(tree_dbs, master_list_name) %>%summarise(n=n())tree_master_list <-fread("2_Data/1_output/eu_native_trees_master.csv") %>%left_join(tree_db_sample_size, by =c("latin_name"="master_list_name")) %>%filter(n>1500)sendstatus(paste0("Had ", nrow(tree_db_sample_size), " species, now down to ", nrow(tree_master_list)))fwrite(tree_master_list, "2_Data/1_output/eu_native_trees_master.csv")rm(tree_db_sample_size, tree_master_list)gc()DBI::dbDisconnect(conn = con)##### EOF ####
What we found to work fastest and least memory intensive is stacking all rasters first and then running terra::extract()
This section explains how to make your species distribution data more user friendly. For a specialist audience, you may be able to expose simply botanical names. Here we’re adding descriptions and images for species with a simple wikipedia scraper.
Code
library(librarian)shelf(rvest)# load list with botanical names and taxonomy IDstree_master_list <-fread("2_Data/1_output/eu_native_trees_master.csv")# a simple function to search wikipedia for botanical nameswikipedia <-function(search_terms, lang =c("en", "de", "es", "fr")) {# if system language contains "en" use English Wikipedia versionif (grepl("en", lang)) {return(paste0("https://en.wikipedia.org/w/index.php?search=", URLencode(search_terms))) }# if system language contains "de" use German Wikipedia versionelseif (grepl("de", lang)) {return(paste0("https://de.wikipedia.org/w/index.php?search=", URLencode(search_terms))) }# if "lang" is not specified and default system language is not# English, German, Spanish or French, use the English versionelse {return(paste0("https://en.wikipedia.org/w/index.php?search=", URLencode(search_terms))) } }# scrape wikipedia for first paragraph and imagefor (i in1:nrow(tree_master_list)) { tree_url <-wikipedia(tree_master_list$latin_name[i], "de") tree_page <- tree_url %>% rvest::read_html() tree_tribble <- tibble::tribble(~descr_de, ~url, ~image_url, rvest::html_text(rvest::html_element(tree_page, xpath ="/html/body/div[3]/div[3]/div[5]/div[1]/p[1]"), trim =TRUE), tree_url, rvest::html_attr(html_element(tree_page, xpath ="/html/body/div[3]/div[3]/div[5]/div[1]/table[1]/tbody/tr[2]/td/span/a/img"), "src") ) tree_master_list$descr_de[i] <- tree_tribble$descr_de tree_master_list$url[i] <- tree_tribble$url tree_master_list$image_url[i] <- tree_tribble$image_urlcat("fetched ", tree_tribble$descr_de, " \n for ", tree_master_list$latin_name[i])# be kind to wikipediaSys.sleep(1.2)}tree_master_list <- tree_master_list %>%mutate(image_url = stringr::str_remove(image_url, "^\\/\\/"))sendstatus("writing master list with drescriptions to postgres")fwrite(tree_master_list, "2_Data/1_output/eu_native_trees_master.csv")con <-backend_con()RPostgres::dbWriteTable(con, "tree_master_list", tree_master_list, overwrite =TRUE)DBI::dbDisconnect(conn = con)
Calculate Percentile Ranges from Mean
We have used the sampled and well documented dataset EU Forest to calculate percentiles ranges. Bioclimatic variables of tree occurrence do not yield a normal distribution, hence we are calculating the distance from mean for each species and variable. The Eu Forest dataset is sampled and contains enough occurrences, while occurrences from crowd-sourced efforts (such as GBIF) are more prone to distorted data (showing where trees have been mapped rather than where trees occur). We have used a cutoff of n>200 occurrences from EU Forest As per this source. Below are histograms for a few sample biclimatic variables: Annual BIO01 (Mean Temperature), BIO5 (Max Temperature of Warmest Month), BIO6 (Min Temperature of Coldest Month), BIO8 (Mean Temperature of Wettest Quarter), BIO9 (Mean Temperature of Driest Quarter), BIO12 (Annual Precipitation), BIO13 (Precipitation of Wettest Month), BIO14 (Precipitation of Driest Month)
Tables created in this ETL pipeline are written to a PostGIS DB. To make raster files available to shiny you have two (actually more like one) options:
Option 1: Writing Raster files to PostGIS DB
The bioclimatic and soil rasters used in this project are several GBs large and usually do not fit into memory. This snipped reads them as raster stack, writes the stack to the PostGIS DB and removes it.
Warning
This section relies on the package rpostgis. Currently, there’s no simple other way to write raster data to postGIS from R, neither sf nor terra nor stars. See this discussion. We therefore switched to the approach below: the Shiny app reads values directly from raster files on disk. Slightly less performant for our use case.
The raster files downloaded from Copernicus CDS can simply be mounted into the Docker container. See our docker-compose for details on this.
Shiny App Development
The shiny app itself is developed as package using the golem framework. Most of the server and ui code explains itself.
Database Deployment
We use the docker image postgis/postgis and only add raster extension to it with a short init script.
Code
from postgis/postgis# need to enable raster extentionsCOPY ./initialize-raster.sh /docker-entrypoint-initdb.d/initialize-raster.sh# CREATE EXTENSION postgis_raster;# SET postgis.gdal_enabled_drivers = 'ENABLE_ALL';# SELECT name, default_version,installed_version# FROM pg_available_extensions WHERE name LIKE 'postgis%' or name LIKE 'address%';
Shiny App Deployment
Running the package barebone in a Docker container
This approach means the R process will simply launch the shiny app. For a more robust deployment, use Shiny Server.
Generate the Dockerfile from the app. Golem handles this with golem::add_dockerfile() but you need to edit the Dockerfile by hand. When adding new packages, ensure they’re at the end of the Dockerfile in order to leverage build caching.
On your host: clone the repository with git clone https://github.com/3ful/treeful.git and move to the directory with cd ./treeful
Run stack with swarm and compose
launch your docker swarm: docker swarm init
If this is the first time running, the Postgis DB will be empty and you need to supply a first user password. While ETL is running the Shiny will not find any data.
In order to serve the same image to all nodes you need a docker registry. Create one with: docker service create --name registry --publish published=5000,target=5000 registry:2
Push images to registry: docker compose push
Now, deploy your swarm: docker stack deploy --compose-file docker-compose.yml treeful
Now you can easily scale up your shiny app to 3 replicas with docker service scale treeful_frontend=3
You can watch visitors of your frontend app with docker service logs treeful_frontend -f
Running in Shiny Server
This is the better option for serving the app to multiple users. Our Dockerfile in 3_shiny_frontend uses this approach. Important aspects:
You can use rocker images, such as shiny-verse.
Make the folder /srv/shiny-server readable by user shiny. Otherwise cache folder cannot be written and performance might suffer.
When building the frontend container: the database container needs to be running and you need to make sure secrets are mounted and available during the package install. Remember, the shiny app is a package.
golem::add_shinyserver_file() will generate an app.R file. Shiny server needs that.
Running in Shinyproxy
We abandoned this option. The main reason was that shinyproxy spins up containers for each user. Since this app requires a fair amount of Ram, it didn’t serve our needs. If you need authentication, this is your option.