Recommending plant taxa for supporting on-site species identification

Recommending plant taxa for supporting on-site species identification Background: Predicting a list of plant taxa most likely to be observed at a given geographical location and time is useful for many scenarios in biodiversity informatics. Since efficient plant species identification is impeded mainly by the large number of possible candidate species, providing a shortlist of likely candidates can help significantly expedite the task. Whereas species distribution models heavily rely on geo-referenced occurrence data, such information still remains largely unused for plant taxa identification tools. Results: In this paper, we conduct a study on the feasibility of computing a ranked shortlist of plant taxa likely to be encountered by an observer in the field. We use the territory of Germany as case study with a total of 7.62M records of freely available plant presence-absence data and occurrence records for 2.7k plant taxa. We systematically study achievable recommendation quality based on two types of source data: binary presence-absence data and individual occurrence records. Furthermore, we study strategies for aggregating records into a taxa recommendation based on location and date of an observation. Conclusion: We evaluate recommendations using 28k geo-referenced and taxa-labeled plant images hosted on the Flickr website as an independent test dataset. Relying on location information from presence-absence data alone results in an average recall of 82%. However, we find that occurrence records are complementary to presence-absence data and using both in combination yields considerably higher recall of 96% along with improved ranking metrics. Ultimately, by reducing the list of candidate taxa by an average of 62%, a spatio-temporal prior can substantially expedite the overall identification problem. Keywords: Plant identification, Location-based, Classification, Spatio-temporal context, Recommender system, Occurrence prediction, Plant distribution Background for laypersons [2–4]. Expediting the task and making it Accurate plant species identification represents the basis feasible for non-experts is highly desirable, especially con- for all aspects of plant related research and is an impor- sidering the continuous loss of plant biodiversity [5]as tant component of workflows in plant ecological research well as the continuous loss of plant taxonomists [6]. The [1]. Numerous activities, such as studying the biodiversity principal challenge in plant identification arises from the richness of a region, monitoring populations of endan- vast number of potential species. Even when narrowing gered species, determining the impact of climate change the focus to the flora of a single country, thousands of on species distribution, and weed control actions depend species need to be discriminated. The flora of Germany on accurate identification skills. They are a necessity for exhibits about 3800 indigenous species [7], the British physiologists, pharmacologists, conservation biologists, & Irish flora comprises around 3000 [8], and the flora technical personnel of environmental agencies, or just fun of Northern America exhibits about 20,000 species of vascular plants [9]. However, most species are not evenly distributed *Correspondence: hans-christian.wittich@tu-ilmenau.de; patrick.maeder@tu-ilmenau.de throughout a larger region as they require more or less Institute for Computer and Systems Engineering, Technische Universität specific combinations of biotic and abiotic factors and Ilmenau, Helmholtzplatz 5, 98693 Ilmenau, Germany resources to be present for their development. Therefore, Full list of author information is available at the end of the article © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Wittich et al. BMC Bioinformatics (2018) 19:190 Page 2 of 17 plant species can be encountered within their specific distributions, recent studies focus on predicting distri- ranges. The German Biodiversity Exploratories project butions across landscapes to gain ecological and evolu- 2 2 [10] studied sites spanning an area of 422 km to 1300 km tionary insights that require extrapolation in space and and found that on grassland sites 318 to 365 vascular plant time [15]. species occurred [11], while on forest sites merely 277 SDMs utilize occurrence data as answer set while train- to 376 species were present [12]. These figures represent ing the model and identifying a characteristic set of pre- less than 10% of the entire German flora. Knowing where dictor variables. This enables their application in areas species occur has long been of interest, dating back to that have not been intensively sampled or under hypo- Linné and Humboldt with mapping projects evolving in thetically changing conditions, e.g., climate change. How- terms of coverage and level of detail over time. A geo- ever, using a limited set of predictor variables often graphic range map represents the area throughout which a results in limited accuracy and spatial resolution. While species occurs, referred to as ‘extent of occurrence’ by the these restrictions are acceptable for ecological and envi- International Union for Conservation of Nature (IUCN). ronmental research on larger scales, the problem we Using range maps as they appear in field guides to support study requires spatially fine-grained estimations. Predic- manual species identification has been state-of-the-art for tion results were found to strongly depend on sampling quite some time. However, species identification is chang- bias [17], sampling size [18, 19], and location uncertainty ing and the usability of field guides has often been debated. [20] decreasing the confidence in SDM results [21, 22]. Taking a user’s current position in the field to estimate Further challenges for SDMs include the improvement of which species could possibly be encountered nearby can methods for modeling presence-only data, model selec- simplify identification tasks and is highly suitable given tion and evaluation as well as proper assessment of model today’s prevalence of mobile devices with self-localization uncertainty [23]. technology. The Map of Life service uses SDM to provide certain In this paper, we study whether previously recorded species range maps for confined geographical areas. Dif- occurrence information can be used to develop a recom- ferent data sources such as expert species range maps, mendation system to significantly reduce the number of species occurrence records, and ecoregions, are aggre- species for the identification task. Resulting recommenda- gated to describe species distributions worldwide [24]. tions could either be used on their own or be incorporated However, the service is hardly of any use for the purpose into species identification services to improve accuracy of species identification since for example the whole area [13]. We conduct a systematic study on different data of Germany seems to be discretized into ≈ 25 tiles and the sources and aggregation strategies to evaluate how accu- only retrieved plant species for this region are ten conifer rately taxa can be retrieved depending on location and species. time of a new observation. We select the territory of The Plant-O-Matic app utilizes SDM to predict a list Germany as study region since its flora is particularly well of all plant species expected to occur at a user’s location described with curated, openly available databases. In par- [25]. For its predictions, the approach uses a 100 × 100 km ticular, we use the following two sources of data. First, discretization grid and 3.6M observations of 89k non- grid-based range maps published by the Federal Agency for cultivated plant species native in America. For rare species Nature Conservation via the FLORKART project. Second, (30k) with only one or two observations the geographic plant observations obtained from the Global Biodiversity range is defined as a 75, 000 km square area surround- Information Facility (GBIF), a service aiming to mobilize ing the occurrence locations. For 12k species with three biodiversity data from museums, surveys, and other data to four observations, the range is defined as convex hull sources by collating locally digitized and stored data in an enveloping all occurrence points. For the remaining 45k online data search portal [14]. species with more than five occurrences, range maps were Previous research exists in two different research direc- predicted using the MaxEnt SDM [26]. MaxEnt uses 19 tions: species distribution modeling as well as automated layers of world climate data and 19 spatial filters captur- species and object identification. ing the geometry of the studied areas as predictor vari- ables. The approach predicts rather long and non-ranked Species distribution modeling (SDM) species lists given the coarse-grained computational dis- SDMs are associative models relating occurrence or abun- cretization and the sparse observation data. dance data of individual species at known locations to information on the environmental characteristics of those Automated species and object identification locations (modified from [15], [16]). Once trained, SDMs We found no study that utilizes the location of an obser- can predict suitable habitats for species based on the uti- vation to support the identification of unknown plant lized environmental characteristics. While initial studies specimen despite intensive research and manifold stud- were mainly seeking insight into causal drivers of species ies in this area [27]. Previous studies largely focus on Wittich et al. BMC Bioinformatics (2018) 19:190 Page 3 of 17 image recognition techniques for automated plant species range whereas we base our estimation entirely on fac- identification [28], how those can be enhanced by careful tual observations. Previous studies on automated species selection of image types [29] and contextual information identification have shown the benefit of using location such as plant size [30]. However, there exists previous information for improving identification results. They did work on more general identification problems that utilizes however not investigate the accuracy of ranked taxa rec- ommendations retrieved directly from occurrence data. location data. Berg et al. used observation time and location of images As such observation records are becoming increasingly for supporting automated bird species identification by available via online services, providing comprehensive computing spatio-temporal prior probabilities for the bird sets of presence-absence as well as presence-only occur- species’ occurrences in North America [31]. Bird-sighting rence records, we argue that a systematic study is required records are discretized into spatio-temporal cubes of 1 that evaluates how spatio-temporal context informa- latitude-longitude and six days. The authors compute the tion can be exploited to inform on-site plant species prior at a given location and time as ratio of the esti- identification. mated density of species observations and the estimated density of any observation at the same location and time. Methods The authors used 75M bird-sighting records of 500 bird Study region and taxa species originating from a citizen-science network. By We use the territory of Germany as evaluation area for our combining image recognition and the spatio-temporal study. Besides giving us the opportunity to test our esti- prior, top-5 accuracy of correctly identified bird specimen mations on site, Germany is representative for countries improved by 15% relatively (≈ 10% absolutely), indicat- with well-documented species populations in range maps ing that the use of spatio-temporal priors can significantly and specimen collections. Moreover, active groups of pas- support automated species identification. sionate professionals constantly contribute observation Tang et al. studied the usage of location context for the data [34]. problem of image classification for 100 location-sensitive In search of a complete species list, we decided to take classessuchas’Beach’,’Disneyland’, and’Mountain’[32]. the widely accepted list of ferns and vascular plants of They constructed high-dimensional (>80k) feature vec- Germany [35] collected by Wisskirchen and Haeupler [7] tors representing contextual information about images as a basis. The list was revised addressing the following location. These features are computed per image location two issues. First, some taxa are known to be exceptionally and derived from five sources: (1) a 25×25 km grid-based difficult to distinguish from each other, their identifica- discretization of the location (20k dim); (2) normalized tion relying on very special characters and often being pixel colors from 17×17 px patches of ten map types impossible to accomplish in the field without a reference referring to average vegetation, congressional district, collection, even for experts. We subsumed 858 species ecoregions, elevation, hazardous waste, land cover, pre- belonging to five of these critical taxa [36] under their cipitation, solar resource, total energy, and wind resource respective parent taxa Ranunculus auricomus, Rubus, Sor- (9k dim); (3) regional statistics on age, sex, race, family and bus, Taraxacum,and Hieracium. Secondly, we excluded relationships, income, health insurance, education, vet- 251 hybrid species expected to cause inconsistent and eran status, disabilities, work status, and living conditions unreliable identifications. Thus, our list is composed of (21k dim); (4) hashtag frequency on Instagram at 10 radii 2,771 plant taxa containing 2,766 taxa at species level as (2k dim); (5) visual context as probability of 594 common well as four at genera and one at aggregate level being concepts appearing on social media website at 10 radii treated as leaves of the taxonomic scheme in our study. (30k dim). Following a dimensional reduction, these con- text features are concatenated with the visual features and Grid-based presence-absence data incorporated into a Convolutional Neural Network before Grid-based presence-absence data stems from large- its softmax layer. The authors report a 19% relative gain scale efforts to systematically map geographic regions. in mean average precision (7% absolute) and a 6% rela- Being the most comprehensive data source for Germany tive improvement of top-5 accuracy (4.5% absolute). Both and providing data for its entire area, we employ the studies clearly suggest that analyzing location and tempo- FLORKART project. FLORKART is the result of cumula- ral context of an identification can substantially improve tive mapping involving thousands of voluntary surveyors identification accuracy. and literature reviews in several organizational subunits Our approach is unique in that it relies on actual obser- [37]. The data is freely accessible via the information vation data directly rather than inferring species distri- system FloraWeb [38]run bythe Federal Agency for bution by means of a model taking these data as input Nature Conservation on behalf of the German Network for training. Being subject to model reliability and data for Phytodiversity (NetPhyD). In FLORKART, presence of quality issues [33], SDMs are used to predict a potential a species is recorded on the basis of grid tiles, originally Wittich et al. BMC Bioinformatics (2018) 19:190 Page 4 of 17 representing pages of ‘Messtischblatt’ (MTB) ordnance were collected during three time periods: before 1950, survey maps with a scale of 1:25,000. Each tile covers a between 1950 and 1980, and 1980 until today. In those section of 10’ longitude ×6’ latitude, corresponding to cases where FLORKART provides records for a coarse- a surface area of approximately 118 km in the north to grained tile as well as for sub-quadrants within the same 140 km in the south of Germany. However, only 3.5% tile, we always consider the newer and higher-resolution of FLORKART grid tiles are of this coarse-grained res- information. This leads to a total of 6,020,296 records olution, with many of them superseded. The majority of in our dataset, with only 0.54% of those accounting for presence-absence information today is provided on the coarse-grained tiles and 0.9% accounting for data from scale of quarter tiles, subdividing each MTB into four before 1950. A median of 514 taxa occurs per grid cell, parts. In spite of the increased resolution each tile still with the 10th percentile being 257 and the 90th percentile only carries the binary information whether a species being 758 taxa. Figure 1 displays the spatial density of appears in it or not. Neither exact spatial coordinates of the records mapped to the area of Germany as well as individual records nor frequency of a species’ occurrence coverage metrics of the FLORKART dataset. are known. FLORKART has proven to be of significant value for Point-based occurrence records biogeographical analyses and the quality of its data has We use the Global Biodiversity Information Facility been validated in numerous studies, e.g., [39, 40]. (GBIF) as the most prominent and comprehensive data FLORKART contains records at all taxonomic levels, source for querying point-based occurrence records for including subspecies and aggregates of species. For this Germany. Occurrence denotes one observation record of study, records were revised in order to map them to our a certain plant and contains information on the taxo- taxa list. In detail, records of child taxa, i.e., subspecies, nomic description, geographic location, observation type, forms and varieties of species, were included and sub- and often also the observation time and date. The GBIF sumed under their respective parent taxon. In result, our web service aggregates occurrence records of numerous FLORKART dataset contains presence-absence data for types, from historic herbarium specimens to citizen sci- the 2771 vascular plant taxa in our species list. On May ence projects, e.g., hobbyists sharing geo-tagged species 3rd and 4th 2017, we acquired a total of 6.59M records for photos. The data differs considerably from the grid-based these taxa across the 13k (quarter-)MTB tiles entirely cov- records described above in that it represents presence- ering Germany. We discarded records that were marked as only records being largely non-curated and collected ’questionable’ or ’false’ (15k records). The remaining data unsystematically at arbitrary locations. a b Fig. 1 Characteristics of the FLORKART dataset – a spatial density of occurrence records per grid cell across all taxa; b average distance to nearest neighbor occurrence per taxon, average over all taxa marked by red line; c frequency distribution of occurrence records per grid cell Wittich et al. BMC Bioinformatics (2018) 19:190 Page 5 of 17 We queried GBIF via the website’s occurrences search In result, this process lead to a total of 1,598,550 occur- interface, restricting records to the area of Germany and rence records for 2,640 out of the 2771 taxa of interest the biological kingdom of Plantae. All queries [41]were in our study. The records contain a median number of 83 executed on August 23, 2017. The point-based occurrence observations per taxon, with a 10th percentile of 4 and a records of interest for our study stem from 1324 datasets 90th percentile of 1,817 observations per taxon. 86% of coming from 484 institutions with the largest contributor these records include plausible timestamps, e.g., they do ’Naturgucker’ providing 27% of the records. We sanitized not use default dates like January 1st 1970, and are dis- the data and filtered out invalid geographical locations, tributed as visualized in Fig. 2(b)and (e). While single i.e., missing or implausible coordinates as well as entries records date back to the year 1768 (i.e., herbarium spec- with abnormally poor spatial accuracy. We mapped the imen), 99% of the records with plausible timestamp are taxa in our list to the GBIF taxonomic backbone using the from 1950 and later. ’species.search’ method of the GBIF API [42]. For every In order to better understand how the retrieved GBIF taxon, the query contained the accepted scientific name as records are distributed across Germany, we calculated per well as synonyms, both including the author(s) describing taxon the average distance between each observation and the taxon. Approximate string matching was applied if the its closest neighbor (see Fig. 2(c)). Lower values indi- author naming was following a different convention, e.g., cate a spatial clustering of records, while higher values abbreviations. show dispersion of records. For comparison, we computed a c b e Fig. 2 Characteristics of the GBIF dataset – a spatial density of occurrence records per grid cell across all taxa; b record distribution per month of observation; c average distance to nearest neighbor occurrence per taxon; d frequency distribution of occurrence records per grid cell; e record distribution per year of observation Wittich et al. BMC Bioinformatics (2018) 19:190 Page 6 of 17 the same metric for the grid-based FLORKART data (see discontinuous urban fabric, 19% on non-irrigated arable Fig. 1(b)). The average closest neighbor distance across all land, 12% on pastures, 9% in broad-leaved forests, and 9% taxa in the GBIF dataset is 21.9 km, while the correspond- in coniferous forests. Another indication of this dataset’s ing value is only 17.9 km for the FLORKART dataset. The highly scattered geographical locations is given by the figures and metrics illustrate the irregular distribution of average nearest neighbor distances (see Fig. 3(c)) showing records and gaps between records across the whole study that data records exist on average only every 128.2 km. For region. a graphical overview of occurrences’ spatial density and We discretized record locations into a regular compu- the amount of geographical coverage see Fig. 3(a). tational grid with each cell spanning 30” longitude ×18” latitude. This discretization was chosen to provide a res- Problem formalization and aggregation strategies olution 100 times higher than FLORKART’s quarter tiles Given an observer’s location p ∈ P as geographic coor- and results in cells of ≈ 0.33 km each. We study the dinates and date of observation d, we determine the can- impact of the computational grid’s resolution in its own didate subset T ⊆ T of all known taxa T that is most p,d subsection below. Only 20% of the grid’s cells are occu- likely to be encountered by the observer. We hypothesize pied by GBIF records with a median of 4 occurrences, the that spatial and temporal distance to registered occur- 10th percentile being 1 and the 90th percentile being 56 rence records affect an observer’s chance to encounter the records. The record frequency per occupied cell is heav- same taxa at their current location in the field. Therefore, ily unbalanced with 50% of all occurrence records being we assign each taxon t ∈ T ascore S reflecting its p,d t,p,d concentrated in merely 0.8% of the occupied cells (cp. chance of being encountered at p and d. Fig. 2(d)). Figure 2(a) visualizes occurrences’ spatial den- T = t ∈ T |S > 0 p,d i t ,p,d sity on a map of Germany with a circle depicting each The result will be a list of taxa, ranked based on scores. record and its given accuracy and each colored pixel rep- Hence, we denote a taxon’s rank by r and define the resenting an computational grid cell. The map shows that resulting ranked list of candidates T as: even though records are sparse and irregularly distributed, p,d they are spread across all parts of Germany. When clas- T = (t, r) : t ∈ T , r ∈ N : r ∈[1, |T |], p,d p,d p,d sifying record locations in terms of land cover [43], 23% ∀t , t , r , r : (t , r ) ∈ T ∧ t = t → (t , r )/ ∈ T i j i j i i p,d i j j j p,d are on non-irrigated arable land, 16% on pastures, 15% in broad-leaved forests, 14% in coniferous forests, and 10% ∀(t , r ), (t , r ) ∈ T : S ≥ S → r < r . i i j j p,d t ,p,d t ,p,d i j i i j j on discontinuous urban fabric. For our test region of Germany we study the quality of ranked candidate lists T by evaluating them based on p,d Independent test dataset the test data introduced above. Test records n = 1 ... N For obtaining an independent test set of occurrence data, are represented as a tuple containing the location p , we used the image hosting and social media website Flickr the observation date d andthelabeledtaxon t .Welet n n [44], a platform where users can upload and share per- T = T for all (p , d , t ) in our set of test records with p,d n n n n sonal photographs. We selected this service specifically n representing the index of the test query. because the uploaded images show what people actually ‘see’ and are interested in. We argue that this will to a Evaluation metrics large extent correlate with plant species people are inter- We aim to asses computed candidate subsets T in terms ested in identifying and recording during their daily life. of completeness, compactness, and efficiency of the rank- We used the Flickr API’s ’photos.search’ method to iden- ing and therefore introduce the following five metrics. tify geotagged images labeled with the scientific name (1) Average recall R measures the ratio of correctly or an accepted synonym of the 2771 taxa considered in retrieved test records in relation to all test records and is our study. From the images’ metadata we extracted the computed as timestamp and the location of acquisition. This process resulted in 28,226 records for 1271 of the 2771 studied 1, if t ∈ T n n R = R ,with R = (1) n n taxa. The summarized statistics are displayed in Fig. 3.In 0, if t ∈ / T N n n n=1 terms of geographical coverage across Germany, the test data is very sparse. Merely 0.69% of the computational Average recall is not only computed for the whole grid cells as defined above are occupied having a median retrieved list but also for subsets thereof, assessing com- of 1 and a maximum of 1,127 records each. The number pleteness up to specific list positions. R refers to the of records per occupied grid cell is biased, concentrated average recall up to rank k and is computed by cutting off mainly around major urban areas and points of interest, the list of results after the k-th position and calculating but resembles that of GBIF (cp. Fig. 3(d) with Fig. 2(d)). the average recall on the remaining sublist (cp. Eq. 1). We Regarding land cover, most record locations (24%) are on report R for k ={20, 514} with 20 items referring to a k Wittich et al. BMC Bioinformatics (2018) 19:190 Page 7 of 17 b e Fig. 3 Characteristics of the Flickr test data – a spatial density of occurrence records per grid cell across all taxa; b record distribution per month of observation; c average distance to nearest neighbor occurrence per taxon; d frequency distribution of occurrence records per grid cell; e record distribution per year of observation user-friendly shortlist of recommendations and 514 being |T | 1 LR = .(3) themediannumberoftaxapresentperFLORKARTgrid N |T | n=1 tile, reflecting the average number of taxa occurring in a local region. (4) Mean reciprocal rank MRR measures the ranking (2) Average list length LL measures the average number quality of retrieved candidate lists for a set of test records. of retrieved candidate taxa across all N test records and is The reciprocal rank is the multiplicative inverse of rank computed as r of the correct taxon for the nth test query and MRR is the average of reciprocal ranks for the whole test set of N queries. A taxon’s reciprocal rank equals 0 if it is not on LL = |T |.(2) the retrieved list T .MRRiscomputed as: n=1 1 1 MRR = ,with(t , r ) ∈ T.(4) n n n (3) Average list reduction LR measures across all N test N r n=1 records the number of retrieved candidate taxa in T in relation to the number of all known taxa T.Weintro- (5) Median rank M measures the rank which at least half duce this metric to better understand to what extent the of selected taxa are ranked higher than and therefore pro- identification problem can be simplified by reducing the vides an indication of the results’ compactness. Similar to number ofpotentialtaxa.Basedonthe totalamountof MRR, it aims to judge the quality of the ranking and where taxa |T | and the number of taxa retrieved with the nth test in the ranked list the correct taxa appear after ranking. It query |T |, LR is computed as is computed as n Wittich et al. BMC Bioinformatics (2018) 19:190 Page 8 of 17 ⎧ ⎫ |T | 1 s N N ⎨ ⎬ S = counts(t , p ).(7) i j t ,p,d i r M =min s ∈N: (t , r) ∩ T ≥ (t , r)∩T n n n n |P | t ,p ⎩ ⎭ i 2 p ∈P t ,p r=1 n=1 r=1 n=1 i (5) S2 Weighted relative frequency of occurrence records ranks taxa based on how often they occur within a radius with their proportion of contribution being We define five strategies for aggregating multiple grid reduced the farther away they occur from the center: tiles and records per taxon depending on their spatial and temporal characteristics. 1 1 S = counts(t , p ). t ,p,d i j |P | 1 + dist(p, p ) t ,p r Retrieval from grid-based presence-absence data i p ∈P t ,p In a first set of experiments, we evaluate presence-absence (8) data of the grid tile containing the test location p ∈ P and, depending on a variable radius parameter, also those in its S3 Minimum spatial distance to records’ tile centers vicinity to compute a set of candidate taxa at a given test ranks taxa within the sampling radius based on their location. Since it is not clear how accurate and up-to-date closest spatial distance to the test location: the available data is, we study how sampling within a cir- min dist(p, p ) p∈P j cle around a test point with four increasing radii (1 km, t ,p S = 1 − .(9) t ,p,d 5 km, 10 km, and 20 km) in addition to sampling at the test max dist(p, p ) p∈P j t ,p point’s true location affects the quality of retrieved candi- S4 Average spatial distance to records’ tile centers ranks date taxa T . The hypothesis being that taxa may extend p,d taxa within the sampling radius based on each taxon’s their range over time and that in cases where a test point mean spatial distance to the test location: resides close to the border of a tile, its neighbor tile may be as relevant as the containing tile itself. We include addi- r dist(p, p ) p ∈P 1 j t ,p ¯ i tional tiles if their center location p ¯ ∈ P falls within the S = 1 − . (10) t ,p,d i r |P | max dist(p, p ) ¯ j t ,p p ∈P sampling radius. The subset P ⊆ P contains tiles’ center j i t ,p locations only. In order to obtain the set of taxa T , we query the grid p,d When considering an area rather than a single point, it tiles across all taxa at a test record’s location p and within may be necessary to aggregate presence records from mul- aradius r for obtaining the taxa set T . p,d tiple tiles. We select four distinct aggregation strategies to study their effect on the quality of retrieved candidate Retrieval from point-based taxon records taxa T .For each taxon t ∈ T,wecompute ascore p,d We evaluate estimation quality based on GBIF records S based on one of these strategies and sort the list T t ,p,d p,d using the same four aggregation strategies S1 . . . S4 that accordingly. These strategies either consider the relative we studied for grid-based presence-absence data and frequency of a taxon’s occurrences within those grid tiles additionally introduce a strategy S5, which considers tem- covered by the sampling circle of radius r or a normal- poraldistancebetween thedateofatest observationand ized Euclidean distance dist(p , p ) between the test point point-based occurrence records. and eligible tiles’ centers defined as those falling within the sampling circle. S5 Temporal distance to months with recorded We let P denote the set of locations within radius r occurrences ranks taxa based on Gaussian-weighted t ,p around p at which taxon t occurs average monthly score centered at the current/test record’s month: P = p ∈ P | counts(t , p )> 0 ∧ dist(p, p ) ≤ r . 12 i i i i t ,p S = countsInMonth(t , p , m) t ,p,d i j (6) i |P | t ,p r m=1 p ∈P t ,p 1 1 2 The function counts : T × P → R yields the number − (m−month(d)) × √ e . of taxon occurrences at a location p. The following four 2π strategies S1 . . . S4 aggregate the individual contributions (11) of occurrences in P in order to compute a rank for all t ,p where the function countsInMonth : T ×P ×N → R yields t ∈ T . p,d a taxon’s chance of occurring at a particular location dur- S1 Relative frequency of occurrence records ranks taxa ing a particular month and month : date → N provides based on how often they occur within a radius of tiles the month of an observation date. S5 is only applicable being sampled: for the 86% point-based occurrence records with valid Wittich et al. BMC Bioinformatics (2018) 19:190 Page 9 of 17 timestamp. Considering the granularity in which bloom- above and are interested in understanding whether the ing periods are usually specified, we discretize records combination of both data sources allows for a more com- observation date into either one or two out of twelve plete and precise estimation of a taxon’s distribution. monthly bins proportionally to observation day’s distance Figure 4 illustrates a possible configuration of a map to the middle of the month. We define the temporal dis- segment aggregating both data sources for one taxon. tance between a test record’s month of year m ∈ N : Occurrence records with different accuracies as well as m ∈[ 1, 12] and that taxa’s occurrences as the weighted grid-based presence data at different scales contribute to sum of a taxon’s monthly scores having the maximal an average value of how likely a taxon can be expected at weight centered around the current month and decreasing a user’s location and its surroundings. both ways. Although potentially being of high precision, GPS loca- Results tions always suffer from certain spatial inaccuracies, often We assess the quality of taxa recommendations by mea- provided as an additional parameter along with the loca- suring how accurately observations from the set of Flickr tion. Over 35% of our GBIF records provide this additional test data can be retrieved and report results of a series value characterizing their spatial accuracy. For this rea- experiments on grid-based presence-absence data, point- son and to mitigate the sparsity of GBIF point data, we based occurrence records, and a combination of both. In consider each point of a recorded observation as having addition, we elaborate on how we run the experiments an influence on its surroundings. We treat coordinates of computationally efficiently. Metrics reported throughout an occurrence record as center of a circle having a radius this section include average recall (R), average list length corresponding to its uncertainty with the expectation of (LL), average list reduction (LR), mean reciprocal rank a taxon’s encounter being highest at the center while lin- (MRR) and median rank (M) as defined in the previous early decreasing concentrically. For the remaining records section. without any indication of spatial accuracy we assume a default accuracy of 500 m reflecting the average accuracy Ranked retrieval from grid-based presence-absence data of GBIF records providing this information in our study. Table 1 summarizes the results of our first set of Similar to the process described before, we query all point- experiments retrieving ranked taxa lists from grid-based based records within a radius r of a test record’s location p presence-absence data. From top to bottom, the table to sample occurrence frequencies and times for obtaining shows retrieval results at the exact location and for the taxa set T . the four aggregation strategies S1 . . . S4. Per strategy we p,d aggregate presence-absence data at four radii 1 km, 5 km, Retrieval from combined grid- and point-based data 10 km, and 20 km. The columns of the table refer to our In a final set of experiments, we investigate estimation previously introduced evaluation metrics. quality based on merged grid-based presence-absence We observe a modest average recall of 82.31% when data and point-based taxa occurrence records. We apply retrieving test observations from the grid cell at the exact thesamefiveaggregation strategies S1... S5 introduced position of a test record using solely presence-absence 01 2 arcmin Fig. 4 Grid section for a single taxon including area and point occurrences with different extents and uncertainties, respectively. The circle shows the sampling radius around the test position (red cross) being queried. The opacity of a tile is proportional to the taxon’s likelihood of being encountered there Wittich et al. BMC Bioinformatics (2018) 19:190 Page 10 of 17 Table 1 Results of ranked taxon retrieval solely using FLORKART grid-based presence-absence data sampled at the exact location and aggregated for increasing radii around Flickr test observations Radius [km] R [%] R [%] R [%] MRR [%] MLL LR 20 514 Retrieval at exact location 0 82.31 3.38 64.13 1.11 307 680 4.54 S1: Relative frequency of occurrence records 1 85.40 2.42 68.15 0.79 300 787 3.79 5 92.35 4.62 74.94 1.52 237 1115 2.59 10 94.47 4.78 74.42 1.35 234 1286 2.23 20 96.14 5.65 72.39 1.81 237 1477 1.92 S2: Weighted relative frequency of occurrence records 1 85.40 2.62 68.73 0.95 287 787 3.79 5 92.35 4.36 74.80 1.56 240 1115 2.59 10 94.47 4.74 74.70 1.55 232 1286 2.23 20 96.14 5.71 73.88 1.78 233 1477 1.92 S3: Minimum spatial distance to records’ tile centers 1 85.40 4.00 63.28 1.14 330 787 3.79 5 92.35 2.85 64.58 1.01 357 1115 2.59 10 94.47 2.13 64.25 0.80 375 1286 2.23 20 96.14 2.52 64.23 0.82 379 1477 1.92 S4: Average spatial distance to records’ tile centers 1 85.40 2.06 60.00 0.65 380 787 3.79 5 92.35 0.46 52.91 0.37 470 1115 2.59 10 94.47 0.68 46.32 0.37 520 1286 2.23 20 96.14 0.81 37.00 0.37 615 1477 1.92 data. The recall increases up to 96.14% when aggregating is less severe when relying on taxa frequency (S1 and data within radii of up to 20 km around a test location. R S2). Since every FLORKART cell only documents the and LR depend only on the sampling radius and remain presence or absence of a particular taxon and not its unaffected by the aggregation strategies S1 . . . S4. frequency, these strategies are only applicable when the While R is noticeably high meaning that an expected sampling radius spans multiple FLORKART cells. The taxon likely appears somewhere on the retrieved list, its weighted aggregation S2 additionally reduces the influ- actual rank is rarely at the top as indicated by low MRR ence of records with increasing distance from the test values. The same result is indicated by low median ranks, location, which allows a finer gradation between center e.g., in merely half of the test cases the expected taxon and neighborhood and thus more diverse score values. ranks higher than 234th place using S1 and a radius of The effectiveness of this strategy is demonstrated by a 10 km. In general, a higher recall of a larger sampling 14.8% and 318.9% increase in MRR over S1 and S4 respec- radius is achieved at the cost of an extended candidate tively as well as an improvement of the median rank list increasing from 680 taxa at the exact location to 1,477 M by 288 positions over S4 when sampling at a radius taxa at a radius of 20 km (cp. Table 1). In consequence, we of 10 km. observe relatively poor ranking quality, illustrated by low values for R and median ranks > 200 at all radii and Ranked retrieval from point-based occurrence records across all aggregation strategies. Table 2 summarizes the results of our second set of In terms of MRR, the methods relying on distances experiments on retrieving ranked taxa lists from point- between test point and quadrant centers (S3 and S4) based occurrence records. Overall, we observe consider- yield the poorest results. This can be attributed to a ably lower recall values compared to the previous set of very small variety of unique distances, i.e., most taxa experiments. At the exact location (r = 0 km), we achieve attaining the same score, which results from the com- an average recall of 36.36%. However, with an increasing paratively coarse-grained FLORKART grid. The problem sampling radius this recall grows to 85.51% at r = 20 km. Wittich et al. BMC Bioinformatics (2018) 19:190 Page 11 of 17 Table 2 Results of ranked taxon retrieval solely using GBIF point-based occurrence records sampled at the exact location and aggregated for increasing radii around Flickr test observations Radius [km] R [%] R [%] R [%] MRR [%] MLL LR 20 514 S1: Relative frequency of occurrence records 0 36.36 19.90 36.36 6.61 17 73 262.00 1 43.40 16.43 43.40 5.06 36 142 218.72 5 59.72 11.28 58.04 3.45 89 337 91.61 10 73.15 12.05 69.45 3.41 111 504 16.60 20 85.51 11.12 77.68 2.71 133 752 5.36 S2: Weighted relative frequency of occurrence records 1 43.40 18.15 43.38 5.54 30 142 218.72 5 59.72 14.31 58.54 4.30 70 337 91.61 10 73.15 13.61 70.52 4.05 89 504 16.60 20 85.51 14.98 79.73 3.77 108 752 5.36 S3: Minimum spatial distance to records’ tile centers 1 43.40 12.84 43.44 3.46 51 142 218.72 5 59.72 14.87 58.46 4.12 66 337 91.61 10 73.15 16.00 71.09 4.59 77 504 16.60 20 85.51 16.46 80.62 4.54 92 752 5.36 S4: Average spatial distance to records’ tile centers 1 43.40 14.51 43.39 4.63 55 142 218.72 5 59.72 12.91 58.50 3.99 76 337 91.61 10 73.15 10.48 70.69 2.97 110 504 16.60 20 85.51 9.68 78.83 2.83 136 752 5.36 S5: Temporal distance to months with recorded occurrences 0 36.35 23.12 36.35 7.36 13 73 261.10 1 43.39 19.81 43.39 5.81 24 141 218.97 5 59.71 12.47 58.84 3.60 77 337 91.78 10 73.15 11.21 69.95 3.08 108 503 16.67 20 85.50 7.25 77.88 1.96 168 751 5.37 We evaluated five ranking strategies for the retrieved compared to S1. We found that MRR and median rank taxa lists based on frequency, spatial distance, and tempo- improve considerably when applying S5 making this strat- ral distance of occurrences. At a radius of 0 km, aggrega- egy a promising option. Aggregating point-based records tion strategies S1 and S5 evaluate the exact computational based on minimum spatial distance (S3) at a radius of grid cell of 0.33 km a test record falls into, producing 20 km was found to be the best performing strategy, yield- highest MRR associated with lowest recall. The remain- ing R = 85.51%, MRR= 4.54%, and M = 92. ing strategies S2 . . . S4 consider spatial distance of records and can accordingly be applied only if the sampling radius Ranked retrieval from combined grid- and point-based spans multiple computational grid cells. Though yielding data the same recall at respective radii, they differ in ranking Table 3 summarizes the results of our third set of experi- quality as expressed by MRR and M. While S2 offers high- ments retrieving ranked taxa lists from a combination of est MRR up to 5 km, S3 improves for larger radii with grid-based presence-absence data and point-based occur- results for S4 falling in between. Ranking based on tem- rence records. poral distance (S5) operates on the 86% GBIF records The combination of both data sources increases recall in with an existing and valid observation time stamp alone. the computed candidate lists for all sampling radii, e.g., at This reduced set of records explains the slightly differ- r = 20 km the individual recall of 96.14% (FLORKART) a ing figures in recall, list length, and list length reduction nd 85.51% (GBIF) increase to 97.4% on the combined data. Wittich et al. BMC Bioinformatics (2018) 19:190 Page 12 of 17 Table 3 Results of ranked taxon retrieval using FLORKART presence-absence data in combination with GBIF point-based occurrence records sampled at the exact location and aggregated for increasing radii around Flickr test observations Radius [km] R [%] R [%] R [%] MRR [%] MLL LR 20 514 S1: Relative frequency of occurrence records 0 86.62 20.89 74.99 7.20 121 692 4.41 1 89.51 16.99 79.66 5.38 135 810 3.67 5 94.10 11.62 83.88 3.67 155 1142 2.54 10 95.98 12.19 83.03 3.55 160 1320 2.18 20 97.40 11.02 80.09 2.77 165 1525 1.86 S2: Weighted relative frequency of occurrence records 1 89.51 19.67 80.35 6.00 116 810 3.67 5 94.10 15.38 84.33 4.60 131 1142 2.54 10 95.98 15.16 84.38 4.28 127 1320 2.18 20 97.40 15.00 84.08 3.83 128 1525 1.86 S3: Minimum spatial distance to records’ tile centers 1 89.51 2.48 68.07 0.94 330 810 3.67 5 94.10 3.25 66.14 1.05 364 1142 2.54 10 95.98 1.90 65.72 0.84 378 1320 2.18 20 97.40 2.82 67.13 1.04 359 1525 1.86 S4: Average spatial distance to records’ tile centers 1 89.51 3.70 63.18 1.81 374 810 3.67 5 94.10 1.09 52.77 0.66 478 1142 2.54 10 95.98 0.76 45.51 0.43 529 1320 2.18 20 97.40 1.05 36.70 0.42 624 1525 1.86 S5: Temporal distance to months with recorded occurrences 0 36.35 23.15 36.35 7.37 13 73 261.10 1 43.39 19.86 43.39 5.76 25 141 218.97 5 59.71 12.52 58.82 3.60 77 337 91.78 10 73.15 11.06 69.95 3.04 108 503 16.67 20 85.50 7.22 77.87 1.98 167 751 5.37 S2+S5: Combined weighted relative frequency and temporal distance 0 86.62 23.98 75.78 8.85 133 692 4.41 1 89.51 22.09 79.65 7.51 119 810 3.67 5 94.10 17.92 84.49 5.69 118 1142 2.54 10 95.98 18.14 85.25 5.12 112 1320 2.18 20 97.40 17.14 85.52 4.61 115 1525 1.86 Even more beneficial is the combination in terms and M. For S2+S5, the 10th percentile rank is 521, the 90th of achieved ranking quality resulting in significantly percentile 8 and the median rank is 118. Figure 5 shows improved results. Improvements are, for example, the distribution of ranks for the correct taxon per test reflected in higher mean reciprocal rank (1.81% vs. 5.69%) record across all individual ranking strategies (S1 . . . S5) and improved median rank (237 vs. 118) (cp. Table 1,S1 and the combination of spatio-temporal ranking (S2+S5) at an aggregation radius of 5 km. The figure shows that the at 20 km with Table 3,S2+S5 at5km). In addition to evaluating the scoring methods by them- correct taxon is ranked more frequently near the begin- selves, we also studied linear combinations of those and ning of the list for S1, S2, S5, and S2+S5 and declining found weighted spatial frequency with temporal scoring towards the end. The combination of S2+S5 shows addi- (see S2+S5 in Table 3) to yield the highest impact on MRR tional benefits especially for the top ranks. S3 and S4 Wittich et al. BMC Bioinformatics (2018) 19:190 Page 13 of 17 0.1 0.01 S1 S2 S3 S4 0.1 S5 S2+S5 0.01 0 1 2 3 10 10 10 10 Rank Fig. 5 Relative and cumulative frequency per rank of correct taxon for recommending Flickr test records from FLORKART and GBIF datasets, using a search radius of 5 km and six different ranking strategies. The dashed vertical lines mark the median of each distribution suffer from more evenly distributed frequencies over most 90% occurrence records. Each figure in the table is an aver- ranks with a visible maximum around their respective age across the ten cross-validation runs. The results show median beyond the 350th rank. that recall R as well as R are well above 99% in all three We also wanted to assess the influence that a richer areas. High median ranks of 33 up to 17 and a R of 38% set of point-based occurrence records could have on our to 56% show the potential of predicting the sought-after result. Therefore, we selected the three sites of the Biodi- taxon near the very top of a recommendation list. versity Exploratories project [10]: (a) Schorfheide-Chorin, (b) Hainich-Dün and (c) Schwäbische Alb as test cases. Considerations on computational efficiency 2 2 The sites span areas from 422 km to 1300 km and have Apart from the influencing factors presented above, the been intensively investigated for plant species occurrences quality of the taxa list depends on an actual implemen- during systematic observations performed since 2006. tation. One important consideration is the resolution The data is available on GBIF. However, our Flickr test of the computational grid used for binning occur- observations proved to be very sparse for these regions rence records within close distance. A trade-off between with merely 13 records in the area of (a), 113 at (b), and required resources in terms of time and space and poten- 15 at (c). Given the very rich set of GBIF observations, tial for improving evaluation metrics has to be made. we decided to perform a 10-fold cross-validation using We therefore varied the parameter of computational grid 10% randomly selected GBIF occurrence records from the resolution while utilizing the best performing combined three areas (N = 76, 696; N = 101, 504; N = 104, 968) aggregation strategy S2+S5 with a sampling radius of a b c as test set and only the remaining 90% as occurrence r = 10 km on joint FLORKART and GBIF data. Our records. Table 4 reports results for the best performing implementation in C++ uses OpenMP to optimize for aggregation method yet (S2+S5) and the combined taxa parallel execution where possible and was run on a state- information consisting of presence-absence data and the of-the-art 10-core, 128GB RAM workstation. Resolution, Table 4 Results of ranked taxon retrieval in selected regions using combined using FLORKART areal data with 10-fold cross-validation on GBIF point data Region R [%] R [%] R [%] MRR [%] MLL LR 20 514 (a) Schorfheide-Chorin 99.95 56.39 99.86 17.42 17 943 2.95 (b) Hainich-Dün 99.72 48.16 99.59 13.08 22 1058 2.65 (c) Schwäbische Alb 99.95 38.03 99.83 10.47 33 935 2.98 Cumulative Relative frequency [%] frequency [%] Wittich et al. BMC Bioinformatics (2018) 19:190 Page 14 of 17 expressed in relation to the quarter MTB tiles originally binary data without any notion of abundance. Using solely used to record presence-absence data, gradually increases presence-absence data means that a rarely observed taxon from top to bottom in Table 5. Results show R remain- will be ranked exactly the same as another, potentially very ing around 96%, while R and R increase slightly and common one that occurs within the same grid tile. 20 514 themedianrankimprovesupto28placesatfiner resolu- tions. We suspect that GBIF data is too sparse for a finer Point-based occurrence records resolution to have a more pronounced impact. The dis- GBIF point-based occurrence records are spatially sparse cretization also introduces rounding errors which distort and irregularly spread across the study region. Contrary to the results. Given the best tradeoff between R and M,we the presence-absence data, they have not been systemati- settled on a 0.33 km tile size being of 100 times finer cally sampled. Accordingly, we observe considerably lower granularity than FLORKART quarter tiles. This granu- average recall at the location of a test record. Using a larity provides the lowest median rank of 114 and an larger sampling radius leads to substantially higher recall. overall recall of 95.98%, it has been used for all other At the largest evaluated radius of 20 km, we achieve a computations in this paper. recall of 86% and an average candidate list length of 752 taxa. This list length is comparable to that computed Discussion based on the systematically sampled FLORKART data Grid-based presence-absence data at comparable recall, i.e., 787 at 85%. This result raises Noticeably, recall does not reach 100% using grid- expectations towards future use of GBIF data with its con- based FLORKART presence-absence data, but shows an tinuously increasing number of records. GBIF data offers increase when sampling a larger radius around the test an insight that presence-absence data do not provide. location. While this may indicate that taxa extended Multiple records of the same taxon in close proximity can their range since they were observed for FLORKART, be aggregated into an observation frequency allowing us it mainly suggests that our test data, being more rep- to estimate which taxa a user would more likely try to resentative of observations an interested hobbyist rather identify. Using this information, we observe a substantially than a botanist may acquire in the field, are not accu- higher mean reciprocal rank and an improved median rately captured by FLORKART information alone. Flickr rank across all evaluated aggregation strategies S1 . . . S5. test records come from a multitude of users and also We found the minimum spatial distance S3 between a consist of cultivated plants observed in urban environ- test record and existing GBIF records to yield the best ments, e.g., city parks and (botany) gardens. Accordingly, ranking results. the ten taxa most frequently failing correct prediction include ornamental and garden plants, such as Narcissus Combined grid- and point-based data pseudonarcissus (Easter Lily), Helleborus niger (Christmas Occurrence records contributed to GBIF via citizen Rose), Eranthis hyemalis (Winter Aconite), Helianthus science projects are not limited to wildlife plant observa- annuus (Common Sunflower), and Leucanthemum vul- tions. Therefore, using both data sources in combination gare (Common Daisy) as well as cultivated and medicinal mitigates the missing predictions of taxa that are hard plants, such as Brassica napus (Rapeseed), Cornus mas to estimate based on wildlife presence-absence data (Cornelian Cherry), Eschscholzia californica (California alone. We found that combining data sources yields poppy), and Prunus cerasifera (Cherry Plum). We should the highest recall across all experiments with a max- therefore seek to include taxa whose presence is not cap- imum of 97.4% at a sampling radius of r = 20 km. tured in wildlife presence-absence data. In addition to the This result demonstrates that the different data mediocre retrieval performance, we also observe a rel- sources are in fact complementary. Taxa that gain the atively poor ranking quality as a direct result of using largest absolute improvement by combining data are Table 5 Influence of grid resolution on evaluation metrics for S2+S5 and r = 10 km ×Quarter Avg. Area Run- RAM RR R MRR M LL LR 20 514 MTB tile [ km ] time [GB] [%] [%] [%] [%] 4 131.49 1.0× 0.5 96.45 16.14 84.00 4.92 140 1,349 2.12 1 32.87 1.1× 0.7 95.79 16.60 84.91 5.36 126 1,285 2.24 1/16 2.05 4.9× 5.7 96.20 17.85 85.13 5.36 114 1,331 2.16 1/64 0.51 15.4× 21.0 95.93 18.24 85.26 5.21 116 1,327 2.17 1/100 0.33 20.5× 33.2 95.98 18.19 85.24 5.14 112 1,320 2.18 1/144 0.23 29.6× 47.0 95.97 18.22 85.23 5.04 115 1,323 2.17 Wittich et al. BMC Bioinformatics (2018) 19:190 Page 15 of 17 Leucanthemum vulgare, Prunus cerasifera, Narcissus combine other data sources and to possibly increase res- pseudonarcissus, Eranthis hyemalis,Cornus mas, Helle- olution and precision of our estimations. To rule out the borus niger,and Brassica napus. Although the recall possibility of our own discretization having an adverse improves by combining data, it still does not reach 100%, effect on data quality, we evaluated results across multiple i.e., retrieved taxa lists are still incomplete with respect to resolutions as one aspect of our study. the test observations obtained from Flickr. This is in part Although being high, recall does not reach 100% in our due to some locations and taxa which yield exceedingly experiments. One possible explanation is insufficient data low recall, i.e., false negatives when evaluated on the test quality since our datasets originate from manual acqui- data. False negatives dominantly occur at urban land sition processes. Revising maps with an extent such as cover types [43], i.e., discontinuous urban fabric (32%) FLORKART is an ongoing process that can never be and sport and leisure facilities (13%). Taking a closer look expected to be complete. The range of species is highly attheresultsofS2+S5at r = 5 km, the average recall is dynamic as a consequence of, e.g., climatic differences and only 94.10% due to 345 individual taxa not being retrieved changes in land use. Some observations date back several in the missing 5.90%. Among the top 66% of these 345 decades while even the more current ones originate from taxa, are 90.7% crop and garden plants. The top three are mapping projects carried out in at least 47 federal project Brassica napus, Narcissus pseudonarcissus,and Cornus regions. GBIF’s observation records have been collected mas. These three taxa account for 13% of the missing in an even more irregular manner, e.g., including citizen- recall alone. science projects. We were able to mitigate some prob- In terms of candidate list ranking, we observed the best lems by analyzing data quality and eliminating erroneous results by combining spatially weighted occurrence fre- records based on a set of heuristics (e.g., implausible dates quencies (S2) and temporal distance (S5) shown by con- and locations). sistently highest MRR values. Improved ranking allows for We purposely chose Flickr observations as test data shorter candidate lists, which for instance is supported by since they reflect potential users and resemble a use R reaching a plateau around 84.5% at r = 5 km, indicat- case in which a taxon recommendation system could be ing a high chance of including the correct taxon before the applied. For instance, some test records are taken in urban 514th rank. An average list length of 1,142 at that distance environments (cp. Fig. 3), such as city parks, botany gar- shows that one would need to consider only 41% of all dens and backyards. However, the data is neither curated taxa of interest in Germany at a given location. Depend- nor verified by experts and is therefore expected to have errors, although verification of user-provided tags ing on the intended use case a compromise between recall and mean reciprocal rank has to be made. For a list as through image classification may yield improvements. complete as possible one would consider a larger area to Flickr records may be imprecise in the labeled taxa as well be sampled whereas a greater list length reduction can be as the recorded location. In extreme cases, images may not achieved by sampling smaller regions. be taken at the place of the original taxon occurrence, e.g., An additional evaluation only at the three Biodiver- images of Abies normannia could show a Christmas tree sity Exploratory sites yielded recall close to 100% and a in a living room. On the upside, this provides a chance of remarkable 56% chance of the correct taxon being among seeing results evaluated under a worst-case scenario. By the top 20 positions of a retrieved list. This result is very conducting a cross-validation with GBIF records, we were promising and shows how results can be improved with able to show that our underlying method can yield results more point-based observation records in the future. of much higher quality when operating on a richer and more fine-grained dataset. Limitations On average, our recommended list contains 1,142 taxa Conclusions using a sampling radius of 5 km and S2+S5 strategy on Recommending a list of plant taxa most likely to be combined observation data corresponding to a list reduc- observed at a given geographical location and time is tion of 2.54. Despite being substantially reduced, the list useful for species identification as well as biodiversity is still long prompting us to understand whether the research. We studied achievable recommendation quality retrieved length is plausible. Studies [11, 12] recording based on two fundamental types of information, individu- species richness with respect to land cover found a total ally and in combination: binary presence-absence data and of 623 and 546 vascular plant species on grassland and individually collected occurrence records. Furthermore, forest plots, respectively. Since we do not consider land we aggregated data with increasing sampling radii around covertypes forour studyandbase ourestimations on test locations and according to five formally defined aggre- FLORKART data with a maximal resolution of 30 km and gation strategies. Additionally, we investigated the influ- a median number of 514 taxa per tile, we consider the ence of data discretization granularity on recommenda- resulting list lengths plausible. It is a future exercise to tion quality as well as on computational efficiency. Wittich et al. BMC Bioinformatics (2018) 19:190 Page 16 of 17 When relying solely on presence-absence data, the cur- References 1. Elphick CS. How you count counts: the importance of methods research rent state-of-the-art when looking for taxa that occur at in applied ecology. J Appl Ecol. 2008;45(5):1313–20. a certain location, we managed to retrieve merely 82.31% 2. Brach AR, Boufford DE. Why are we still producing paper floras?. Ann Mo of the test records, recommending the correct one at the Bot Gard. 2011;98(3):297–300. 3. Farnsworth EJ, Chu M, Kress WJ, Neill AK, Best JH, Pickering J, 307th place in the list on average. By combining both Stevenson RD, Courtney GW, VanDyk JK, Ellison AM. Next-generation data sources, increasing the sampling radius, and using a field guides. BioScience. 2013;63(11):891–9. sophisticated aggregation strategy we were able to retrieve 4. Austen GE, Bindemann M, Griffiths RA, Roberts DL. Species identification by experts and non-experts: comparing images from field guides. Sci Rep. 95.98% of the test records, recommending the correct one 2016;6:. on average at the 112th place in the list. When focus- 5. Ceballos G, Ehrlich PR, Barnosky AD, García A, Pringle RM, Palmer TM. ing on regions heavily sampled in terms of occurrence Accelerated modern human–induced species losses: Entering the sixth mass extinction. Sci Adv. 2015;1(5):. https://doi.org/10.1126/sciadv. records, we even retrieved more than 99% of the test 1400253. http://advances.sciencemag.org/content/1/5/e1400253.full.pdf. records’ taxa with the sought-after one ranking on aver- 6. Hopkins G, Freckleton R. Declines in the numbers of amateur and age at the 24th place. In conclusion, we found that both professional taxonomists: implications for conservation. In: Animal Conservation Forum. Cambridge University Press; 2002. p. 245–9. studied data sources are highly complementary for use in 7. Wisskirchen R, Haeupler H. Standardliste der Farn- und Blütenpflanzen a recommendation system. We demonstrated that such a Deutschlands. Stuttgart: Eugen Ulmer; 1998. system can be highly efficient in reducing the search space 8. Preston CD, Pearman D, Dines TD, et al. New Atlas of the British & Irish Flora. Oxford: Oxford University Press; 2002. for species identification tasks with on average only 41% 9. Flora of North America Editorial Committee. Flora of North America: North of all taxa needing to be considered at a given location. We of Mexico. New York and Oxford: Oxford University Press; 1993. also demonstrated that with the ongoing growth of species 10. Fischer M, Bossdorf O, Gockel S, Hänsel F, Hemp A, Hessenmöller D, Korte G, Nieschulze J, Pfeiffer S, Prati D, Renner S, Schöning I, occurrence records in repositories like GBIF these results Schumacher U, Wells K, Buscot F, Kalko EKV, Linsenmair KE, Schulze E-D, will constantly improve even further. Weisser WW. Implementing large-scale and long-term functional biodiversity research: The biodiversity exploratories. Basic Appl Ecol. 2010;11(6):473–85. https://doi.org/10.1016/j.baae.2010.07.009. Acknowledgements 11. Socher SA, Prati D, Boch S, Müller J, Baumbach H, Gockel S, Hemp A, We acknowledge support for the Article Processing Charge by the Thuringian Schöning I, Wells K, Buscot F, Kalko EKV, Linsenmair KE, Schulze E-D, Ministry for Economic Affairs, Science and Digital Society and the Open Access Weisser WW, Fischer M. Interacting effects of fertilization, mowing and Publication Fund of the Technische Universität Ilmenau. grazing on plant species diversity of 1500 grasslands in germany differ between regions. Basic Appl Ecol. 2013;14(2):126–36. https://doi.org/10. Funding 1016/j.baae.2012.12.003. We are funded by the German Ministry of Education and Research (BMBF) 12. Boch S, Prati D, Müller J, Socher S, Baumbach H, Buscot F, Gockel S, grants: 01LC1319A and 01LC1319B; the German Federal Ministry for the Hemp A, Hessenmöller D, Kalko EKV, Linsenmair KE, Pfeiffer S, Environment, Nature Conservation, Building and Nuclear Safety (BMUB) grant: Pommer U, Schöning I, Schulze E-D, Seilwinder C, Weisser WW, 3514 685C19; and the Stiftung Naturschutz Thüringen (SNT) grant: Wells K, Fischer M. High plant species richness indicates management- SNT-082-248-03/2014. related disturbances rather than the conservation status of forests. Basic Appl Ecol. 2013;14(6):496–505. https://doi.org/10.1016/j.baae.2013. Availability of data and materials 06.001. The list of Flickr photo IDs comprising our test data set as well as the project’s 13. Wäldchen J, Rzanny M, Seeland M, Mäder P. Automated plant species source code are available at https://sites.google.com/site/specrecbmc.The identification – trends and future directions. PLoS Comput Biol. GBIF occurrence data set is available at https://doi.org/10.15468/dl.5zmlxt. 2018;14(4):1005993. https://doi.org/10.1371/journal.pcbi.1005993. 14. GBIF: The Global Biodiversity Information Facility. What is GBIF? [12th Authors’ contributions October 2017]. 2017. Available from http://www.gbif.org/what-is-gbif. Funding acquisition: PM, JW; experiment design: MS, HCW, JW, PM, MR; data 15. Elith J, Leathwick JR. Species Distribution Models: Ecological Explanation analysis: HCW, MS; data visualization: HCW; writing manuscript: HCW, MS, PM, and Prediction Across Space and Time. Annu Rev Ecol Evol Syst. 2009;40: JW, MR; all authors read and approved the final manuscript. 677–97. https://doi.org/10.1146/annurev.ecolsys.110308.12015. 16. Cassini MH. Ecological principles of species distribution models: the Ethics approval and consent to participate habitat matching rule. J Biogeogr. 2011;38(11):2057–65. https://doi.org/ Not applicable. 10.1111/j.1365-2699.2011.02552.x. 17. Beck J, Böller M, Erhardt A, Schwanghart W. Spatial bias in the gbif database and its effect on modeling species’ geographic distributions. Competing interests Ecol Inform. 2014;19:10–15. https://doi.org/10.1016/j.ecoinf.2013.11.002. The authors declare that they have no competing interests. 18. Hernandez PA, Graham CH, Master LL, Albert DL. The effect of sample size and species characteristics on performance of different species Publisher’s Note distribution modeling methods. Ecography. 2006;29(5):773–85. Springer Nature remains neutral with regard to jurisdictional claims in https://doi.org/10.1111/j.0906-7590.2006.04700.x. published maps and institutional affiliations. 19. Wisz MS, Hijmans RJ, Li J, Peterson AT, Graham CH, Guisan A, Group NPSDW. Effects of sample size on the performance of species distribution Author details models. Divers Distrib. 2008;14(5):763–73. https://doi.org/10.1111/j.1472- Institute for Computer and Systems Engineering, Technische Universität 4642.2008.00482.x. Ilmenau, Helmholtzplatz 5, 98693 Ilmenau, Germany . Department 20. Graham CH, Ferrier S, Huettman F, Moritz C, Peterson AT. New Biogeochemical Integration, Max-Planck-Institute for Biogeochemistry, developments in museum-based informatics and applications in Hans-Knöll-Str. 10, 07745 Jena, Germany . biodiversity analysis. Trends Ecol Evol. 2004;19(9):497–503. 21. Araújo MB, Guisan A. Five (or so) challenges for species distribution Received: 15 December 2017 Accepted: 14 May 2018 modelling. J Biogeogr. 2006;33(10):1677–88. https://doi.org/10.1111/j. 1365-2699.2006.01584.x. Wittich et al. BMC Bioinformatics (2018) 19:190 Page 17 of 17 22. Jiménez-Valverde A, Lobo JM, Hortal J. Not as good as they seem: the 42. GBIF RESTful JSON-based API. http://api.gbif.org/v1. Accessed 3 May 2017. importance of concepts in species distribution modelling. Divers Distrib. 43. Bossard M, Feranec J, Otahel J. CORINE land cover technical guide - 2008;14(6):885–90. https://doi.org/10.1111/j.1472-4642.2008.00496.x. Addendum 2000, Technical report No 40. Copenhagen: European 23. Elith J, Graham CH, Anderson RP, Dudík M, Ferrier S, Guisan A, Hijmans Environment Agency; 2000. RJ, Huettmann F, Leathwick JR, Lehmann A, Li J, Lohmann LG, Loiselle 44. Flickr Photo/video Hosting Service. https://www.flickr.com. Accessed 4 BA, Manion G, Moritz C, Nakamura M, Nakazawa Y, Overton JM, May 2017. Townsend Peterson A, Phillips SJ, Richardson K, Scachetti-Pereira R, Schapire RE, Soberón J, Williams S, Wisz MS, Zimmermann NE. Novel methods improve prediction of species’ distributions from occurrence data. Ecography. 2006;29(2):129–51. https://doi.org/10.1111/j.2006.0906- 7590.04596.x. 24. Jetz W, McPherson JM, Guralnick RP. Integrating biodiversity distribution knowledge: toward a global map of life. Trends Ecol Evol. 2012;27(3): 151–9. https://doi.org/10.1016/j.tree.2011.09.007. 25. Goldsmith GR, Morueta-Holme N, Sandel B, Fitz ED, Fitz SD, Boyle B, Casler N, Engemann K, Jørgensen PM, Kraft NJB, McGill B, Peet RK, Piel WH, Spencer N, Svenning J-C, Thiers BM, Violle C, Wiser SK, Enquist BJ. Plant-o-matic: a dynamic and mobile guide to all plants of the americas. Methods Ecol Evol. 2016;7(8):960–5. https://doi.org/10.1111/2041-210X. 26. Phillips SJ, Dudík M. Modeling of species distributions with maxent: new extensions and a comprehensive evaluation. Ecography. 2008;31(2): 161–75. https://doi.org/10.1111/j.0906-7590.2008.5203.x. 27. Wäldchen J, Mäder P. Plant species identification using computer vision techniques: A systematic literature review. Archiv Comput Methods Eng. 20171–37. https://doi.org/10.1007/s11831-016-9206-z. 28. Seeland M, Rzanny M, Alaqraa N, Wäldchen J, Mäder P. Plant species classification using flower images – a comparative study of local feature representations. PLOS ONE. 2017;12(2):1–29. https://doi.org/10.1371/ journal.pone.0170629. 29. Rzanny M, Seeland M, Wäldchen J, Mäder P. Acquiring and preprocessing leaf images for automated plant identification: understanding the tradeoff between effort and information gain. Plant Methods. 2017;13(1):97. https://doi.org/10.1186/s13007-017-0245-8. 30. Hofmann M, Seeland M, Mäder P. Efficiently annotating object images with absolute size information using mobile devices. Int J Comput Vis. 2018. https://doi.org/10.1007/s11263-018-1093-3. 31. Berg T, Liu J, Lee SW, Alexander ML, Jacobs DW, Belhumeur PN. Birdsnap: Large-scale fine-grained visual categorization of birds. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE; 2014. p. 2019–26. https://doi.org/10.1109/CVPR.2014.259. 32. Tang K, Paluri M, Fei-Fei L, Fergus R, Bourdev L. Improving image classification with location context. In: 2015 IEEE International Conference on Computer Vision (ICCV). Boston: IEEE; 2015. p. 1008–16. https://doi. org/10.1109/ICCV.2015.121. 33. Barry S, Elith J. Error and uncertainty in habitat models. J Appl Ecol. 2006;43(3):413–23. https://doi.org/10.1111/j.1365-2664.2006.01136.x. 34. Chandler M, See L, Copas K, Bonde AMZ, López BC, Danielsen F, Legind JK, Masinde S, Miller-Rushing AJ, Newman G, Rosemartin A, Turak E. Contribution of citizen science towards international biodiversity monitoring. Biol Conserv. 2016. https://doi.org/10.1016/j.biocon.2016.09. 35. EDIT Platform for Cybertaxonomy. http://api.cybertaxonomy.org/ rl_standardliste. Accessed 3 May 2017. 36. Müller F, Ritz CM, Welk E, Wesche K. Rothmaler-Exkursionsflora Von Deutschland: Gefäßpflanzen: Kritischer Ergänzungsband. Berlin, Heidelberg: Springer; 2016. 37. Netzwerk Phytodiversität Deutschland und Bundesamt für Naturschutz. Verbreitungsatlas der Farn- und Blütenpflanzen Deutschlands, LV-Buch. Münster: Landwirtschaftverlag; 2013. 38. FloraWeb. Daten und Informationen zu Wildpflanzen und zur Vegetation Deutschlands. http://www.floraweb.de. Accessed 3 May 2017. 39. Kühn I, Brandl R, Klotz S. The flora of german cities is naturally species rich. Evol Ecol Res. 2004;6(5):749–64. 40. Kühn I, Bierman SM, Durka W, Klotz S. Relating geographical variation in pollination types to environmental and spatial factors using novel statistical methods. New Phytologist. 2006;172(1):127–39. 41. GBIF.org (23rd August 2017) GBIF Occurrence Download. https://doi.org/ 10.15468/dl.5zmlxt. Accessed 23 Aug 2017. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Bioinformatics Springer Journals

Recommending plant taxa for supporting on-site species identification

Free
17 pages

Loading next page...
 
/lp/springer_journal/recommending-plant-taxa-for-supporting-on-site-species-identification-mzcVXttPVc
Publisher
Springer Journals
Copyright
Copyright © 2018 by The Author(s)
Subject
Life Sciences; Bioinformatics; Microarrays; Computational Biology/Bioinformatics; Computer Appl. in Life Sciences; Algorithms
eISSN
1471-2105
D.O.I.
10.1186/s12859-018-2201-7
Publisher site
See Article on Publisher Site

Abstract

Background: Predicting a list of plant taxa most likely to be observed at a given geographical location and time is useful for many scenarios in biodiversity informatics. Since efficient plant species identification is impeded mainly by the large number of possible candidate species, providing a shortlist of likely candidates can help significantly expedite the task. Whereas species distribution models heavily rely on geo-referenced occurrence data, such information still remains largely unused for plant taxa identification tools. Results: In this paper, we conduct a study on the feasibility of computing a ranked shortlist of plant taxa likely to be encountered by an observer in the field. We use the territory of Germany as case study with a total of 7.62M records of freely available plant presence-absence data and occurrence records for 2.7k plant taxa. We systematically study achievable recommendation quality based on two types of source data: binary presence-absence data and individual occurrence records. Furthermore, we study strategies for aggregating records into a taxa recommendation based on location and date of an observation. Conclusion: We evaluate recommendations using 28k geo-referenced and taxa-labeled plant images hosted on the Flickr website as an independent test dataset. Relying on location information from presence-absence data alone results in an average recall of 82%. However, we find that occurrence records are complementary to presence-absence data and using both in combination yields considerably higher recall of 96% along with improved ranking metrics. Ultimately, by reducing the list of candidate taxa by an average of 62%, a spatio-temporal prior can substantially expedite the overall identification problem. Keywords: Plant identification, Location-based, Classification, Spatio-temporal context, Recommender system, Occurrence prediction, Plant distribution Background for laypersons [2–4]. Expediting the task and making it Accurate plant species identification represents the basis feasible for non-experts is highly desirable, especially con- for all aspects of plant related research and is an impor- sidering the continuous loss of plant biodiversity [5]as tant component of workflows in plant ecological research well as the continuous loss of plant taxonomists [6]. The [1]. Numerous activities, such as studying the biodiversity principal challenge in plant identification arises from the richness of a region, monitoring populations of endan- vast number of potential species. Even when narrowing gered species, determining the impact of climate change the focus to the flora of a single country, thousands of on species distribution, and weed control actions depend species need to be discriminated. The flora of Germany on accurate identification skills. They are a necessity for exhibits about 3800 indigenous species [7], the British physiologists, pharmacologists, conservation biologists, & Irish flora comprises around 3000 [8], and the flora technical personnel of environmental agencies, or just fun of Northern America exhibits about 20,000 species of vascular plants [9]. However, most species are not evenly distributed *Correspondence: hans-christian.wittich@tu-ilmenau.de; patrick.maeder@tu-ilmenau.de throughout a larger region as they require more or less Institute for Computer and Systems Engineering, Technische Universität specific combinations of biotic and abiotic factors and Ilmenau, Helmholtzplatz 5, 98693 Ilmenau, Germany resources to be present for their development. Therefore, Full list of author information is available at the end of the article © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Wittich et al. BMC Bioinformatics (2018) 19:190 Page 2 of 17 plant species can be encountered within their specific distributions, recent studies focus on predicting distri- ranges. The German Biodiversity Exploratories project butions across landscapes to gain ecological and evolu- 2 2 [10] studied sites spanning an area of 422 km to 1300 km tionary insights that require extrapolation in space and and found that on grassland sites 318 to 365 vascular plant time [15]. species occurred [11], while on forest sites merely 277 SDMs utilize occurrence data as answer set while train- to 376 species were present [12]. These figures represent ing the model and identifying a characteristic set of pre- less than 10% of the entire German flora. Knowing where dictor variables. This enables their application in areas species occur has long been of interest, dating back to that have not been intensively sampled or under hypo- Linné and Humboldt with mapping projects evolving in thetically changing conditions, e.g., climate change. How- terms of coverage and level of detail over time. A geo- ever, using a limited set of predictor variables often graphic range map represents the area throughout which a results in limited accuracy and spatial resolution. While species occurs, referred to as ‘extent of occurrence’ by the these restrictions are acceptable for ecological and envi- International Union for Conservation of Nature (IUCN). ronmental research on larger scales, the problem we Using range maps as they appear in field guides to support study requires spatially fine-grained estimations. Predic- manual species identification has been state-of-the-art for tion results were found to strongly depend on sampling quite some time. However, species identification is chang- bias [17], sampling size [18, 19], and location uncertainty ing and the usability of field guides has often been debated. [20] decreasing the confidence in SDM results [21, 22]. Taking a user’s current position in the field to estimate Further challenges for SDMs include the improvement of which species could possibly be encountered nearby can methods for modeling presence-only data, model selec- simplify identification tasks and is highly suitable given tion and evaluation as well as proper assessment of model today’s prevalence of mobile devices with self-localization uncertainty [23]. technology. The Map of Life service uses SDM to provide certain In this paper, we study whether previously recorded species range maps for confined geographical areas. Dif- occurrence information can be used to develop a recom- ferent data sources such as expert species range maps, mendation system to significantly reduce the number of species occurrence records, and ecoregions, are aggre- species for the identification task. Resulting recommenda- gated to describe species distributions worldwide [24]. tions could either be used on their own or be incorporated However, the service is hardly of any use for the purpose into species identification services to improve accuracy of species identification since for example the whole area [13]. We conduct a systematic study on different data of Germany seems to be discretized into ≈ 25 tiles and the sources and aggregation strategies to evaluate how accu- only retrieved plant species for this region are ten conifer rately taxa can be retrieved depending on location and species. time of a new observation. We select the territory of The Plant-O-Matic app utilizes SDM to predict a list Germany as study region since its flora is particularly well of all plant species expected to occur at a user’s location described with curated, openly available databases. In par- [25]. For its predictions, the approach uses a 100 × 100 km ticular, we use the following two sources of data. First, discretization grid and 3.6M observations of 89k non- grid-based range maps published by the Federal Agency for cultivated plant species native in America. For rare species Nature Conservation via the FLORKART project. Second, (30k) with only one or two observations the geographic plant observations obtained from the Global Biodiversity range is defined as a 75, 000 km square area surround- Information Facility (GBIF), a service aiming to mobilize ing the occurrence locations. For 12k species with three biodiversity data from museums, surveys, and other data to four observations, the range is defined as convex hull sources by collating locally digitized and stored data in an enveloping all occurrence points. For the remaining 45k online data search portal [14]. species with more than five occurrences, range maps were Previous research exists in two different research direc- predicted using the MaxEnt SDM [26]. MaxEnt uses 19 tions: species distribution modeling as well as automated layers of world climate data and 19 spatial filters captur- species and object identification. ing the geometry of the studied areas as predictor vari- ables. The approach predicts rather long and non-ranked Species distribution modeling (SDM) species lists given the coarse-grained computational dis- SDMs are associative models relating occurrence or abun- cretization and the sparse observation data. dance data of individual species at known locations to information on the environmental characteristics of those Automated species and object identification locations (modified from [15], [16]). Once trained, SDMs We found no study that utilizes the location of an obser- can predict suitable habitats for species based on the uti- vation to support the identification of unknown plant lized environmental characteristics. While initial studies specimen despite intensive research and manifold stud- were mainly seeking insight into causal drivers of species ies in this area [27]. Previous studies largely focus on Wittich et al. BMC Bioinformatics (2018) 19:190 Page 3 of 17 image recognition techniques for automated plant species range whereas we base our estimation entirely on fac- identification [28], how those can be enhanced by careful tual observations. Previous studies on automated species selection of image types [29] and contextual information identification have shown the benefit of using location such as plant size [30]. However, there exists previous information for improving identification results. They did work on more general identification problems that utilizes however not investigate the accuracy of ranked taxa rec- ommendations retrieved directly from occurrence data. location data. Berg et al. used observation time and location of images As such observation records are becoming increasingly for supporting automated bird species identification by available via online services, providing comprehensive computing spatio-temporal prior probabilities for the bird sets of presence-absence as well as presence-only occur- species’ occurrences in North America [31]. Bird-sighting rence records, we argue that a systematic study is required records are discretized into spatio-temporal cubes of 1 that evaluates how spatio-temporal context informa- latitude-longitude and six days. The authors compute the tion can be exploited to inform on-site plant species prior at a given location and time as ratio of the esti- identification. mated density of species observations and the estimated density of any observation at the same location and time. Methods The authors used 75M bird-sighting records of 500 bird Study region and taxa species originating from a citizen-science network. By We use the territory of Germany as evaluation area for our combining image recognition and the spatio-temporal study. Besides giving us the opportunity to test our esti- prior, top-5 accuracy of correctly identified bird specimen mations on site, Germany is representative for countries improved by 15% relatively (≈ 10% absolutely), indicat- with well-documented species populations in range maps ing that the use of spatio-temporal priors can significantly and specimen collections. Moreover, active groups of pas- support automated species identification. sionate professionals constantly contribute observation Tang et al. studied the usage of location context for the data [34]. problem of image classification for 100 location-sensitive In search of a complete species list, we decided to take classessuchas’Beach’,’Disneyland’, and’Mountain’[32]. the widely accepted list of ferns and vascular plants of They constructed high-dimensional (>80k) feature vec- Germany [35] collected by Wisskirchen and Haeupler [7] tors representing contextual information about images as a basis. The list was revised addressing the following location. These features are computed per image location two issues. First, some taxa are known to be exceptionally and derived from five sources: (1) a 25×25 km grid-based difficult to distinguish from each other, their identifica- discretization of the location (20k dim); (2) normalized tion relying on very special characters and often being pixel colors from 17×17 px patches of ten map types impossible to accomplish in the field without a reference referring to average vegetation, congressional district, collection, even for experts. We subsumed 858 species ecoregions, elevation, hazardous waste, land cover, pre- belonging to five of these critical taxa [36] under their cipitation, solar resource, total energy, and wind resource respective parent taxa Ranunculus auricomus, Rubus, Sor- (9k dim); (3) regional statistics on age, sex, race, family and bus, Taraxacum,and Hieracium. Secondly, we excluded relationships, income, health insurance, education, vet- 251 hybrid species expected to cause inconsistent and eran status, disabilities, work status, and living conditions unreliable identifications. Thus, our list is composed of (21k dim); (4) hashtag frequency on Instagram at 10 radii 2,771 plant taxa containing 2,766 taxa at species level as (2k dim); (5) visual context as probability of 594 common well as four at genera and one at aggregate level being concepts appearing on social media website at 10 radii treated as leaves of the taxonomic scheme in our study. (30k dim). Following a dimensional reduction, these con- text features are concatenated with the visual features and Grid-based presence-absence data incorporated into a Convolutional Neural Network before Grid-based presence-absence data stems from large- its softmax layer. The authors report a 19% relative gain scale efforts to systematically map geographic regions. in mean average precision (7% absolute) and a 6% rela- Being the most comprehensive data source for Germany tive improvement of top-5 accuracy (4.5% absolute). Both and providing data for its entire area, we employ the studies clearly suggest that analyzing location and tempo- FLORKART project. FLORKART is the result of cumula- ral context of an identification can substantially improve tive mapping involving thousands of voluntary surveyors identification accuracy. and literature reviews in several organizational subunits Our approach is unique in that it relies on actual obser- [37]. The data is freely accessible via the information vation data directly rather than inferring species distri- system FloraWeb [38]run bythe Federal Agency for bution by means of a model taking these data as input Nature Conservation on behalf of the German Network for training. Being subject to model reliability and data for Phytodiversity (NetPhyD). In FLORKART, presence of quality issues [33], SDMs are used to predict a potential a species is recorded on the basis of grid tiles, originally Wittich et al. BMC Bioinformatics (2018) 19:190 Page 4 of 17 representing pages of ‘Messtischblatt’ (MTB) ordnance were collected during three time periods: before 1950, survey maps with a scale of 1:25,000. Each tile covers a between 1950 and 1980, and 1980 until today. In those section of 10’ longitude ×6’ latitude, corresponding to cases where FLORKART provides records for a coarse- a surface area of approximately 118 km in the north to grained tile as well as for sub-quadrants within the same 140 km in the south of Germany. However, only 3.5% tile, we always consider the newer and higher-resolution of FLORKART grid tiles are of this coarse-grained res- information. This leads to a total of 6,020,296 records olution, with many of them superseded. The majority of in our dataset, with only 0.54% of those accounting for presence-absence information today is provided on the coarse-grained tiles and 0.9% accounting for data from scale of quarter tiles, subdividing each MTB into four before 1950. A median of 514 taxa occurs per grid cell, parts. In spite of the increased resolution each tile still with the 10th percentile being 257 and the 90th percentile only carries the binary information whether a species being 758 taxa. Figure 1 displays the spatial density of appears in it or not. Neither exact spatial coordinates of the records mapped to the area of Germany as well as individual records nor frequency of a species’ occurrence coverage metrics of the FLORKART dataset. are known. FLORKART has proven to be of significant value for Point-based occurrence records biogeographical analyses and the quality of its data has We use the Global Biodiversity Information Facility been validated in numerous studies, e.g., [39, 40]. (GBIF) as the most prominent and comprehensive data FLORKART contains records at all taxonomic levels, source for querying point-based occurrence records for including subspecies and aggregates of species. For this Germany. Occurrence denotes one observation record of study, records were revised in order to map them to our a certain plant and contains information on the taxo- taxa list. In detail, records of child taxa, i.e., subspecies, nomic description, geographic location, observation type, forms and varieties of species, were included and sub- and often also the observation time and date. The GBIF sumed under their respective parent taxon. In result, our web service aggregates occurrence records of numerous FLORKART dataset contains presence-absence data for types, from historic herbarium specimens to citizen sci- the 2771 vascular plant taxa in our species list. On May ence projects, e.g., hobbyists sharing geo-tagged species 3rd and 4th 2017, we acquired a total of 6.59M records for photos. The data differs considerably from the grid-based these taxa across the 13k (quarter-)MTB tiles entirely cov- records described above in that it represents presence- ering Germany. We discarded records that were marked as only records being largely non-curated and collected ’questionable’ or ’false’ (15k records). The remaining data unsystematically at arbitrary locations. a b Fig. 1 Characteristics of the FLORKART dataset – a spatial density of occurrence records per grid cell across all taxa; b average distance to nearest neighbor occurrence per taxon, average over all taxa marked by red line; c frequency distribution of occurrence records per grid cell Wittich et al. BMC Bioinformatics (2018) 19:190 Page 5 of 17 We queried GBIF via the website’s occurrences search In result, this process lead to a total of 1,598,550 occur- interface, restricting records to the area of Germany and rence records for 2,640 out of the 2771 taxa of interest the biological kingdom of Plantae. All queries [41]were in our study. The records contain a median number of 83 executed on August 23, 2017. The point-based occurrence observations per taxon, with a 10th percentile of 4 and a records of interest for our study stem from 1324 datasets 90th percentile of 1,817 observations per taxon. 86% of coming from 484 institutions with the largest contributor these records include plausible timestamps, e.g., they do ’Naturgucker’ providing 27% of the records. We sanitized not use default dates like January 1st 1970, and are dis- the data and filtered out invalid geographical locations, tributed as visualized in Fig. 2(b)and (e). While single i.e., missing or implausible coordinates as well as entries records date back to the year 1768 (i.e., herbarium spec- with abnormally poor spatial accuracy. We mapped the imen), 99% of the records with plausible timestamp are taxa in our list to the GBIF taxonomic backbone using the from 1950 and later. ’species.search’ method of the GBIF API [42]. For every In order to better understand how the retrieved GBIF taxon, the query contained the accepted scientific name as records are distributed across Germany, we calculated per well as synonyms, both including the author(s) describing taxon the average distance between each observation and the taxon. Approximate string matching was applied if the its closest neighbor (see Fig. 2(c)). Lower values indi- author naming was following a different convention, e.g., cate a spatial clustering of records, while higher values abbreviations. show dispersion of records. For comparison, we computed a c b e Fig. 2 Characteristics of the GBIF dataset – a spatial density of occurrence records per grid cell across all taxa; b record distribution per month of observation; c average distance to nearest neighbor occurrence per taxon; d frequency distribution of occurrence records per grid cell; e record distribution per year of observation Wittich et al. BMC Bioinformatics (2018) 19:190 Page 6 of 17 the same metric for the grid-based FLORKART data (see discontinuous urban fabric, 19% on non-irrigated arable Fig. 1(b)). The average closest neighbor distance across all land, 12% on pastures, 9% in broad-leaved forests, and 9% taxa in the GBIF dataset is 21.9 km, while the correspond- in coniferous forests. Another indication of this dataset’s ing value is only 17.9 km for the FLORKART dataset. The highly scattered geographical locations is given by the figures and metrics illustrate the irregular distribution of average nearest neighbor distances (see Fig. 3(c)) showing records and gaps between records across the whole study that data records exist on average only every 128.2 km. For region. a graphical overview of occurrences’ spatial density and We discretized record locations into a regular compu- the amount of geographical coverage see Fig. 3(a). tational grid with each cell spanning 30” longitude ×18” latitude. This discretization was chosen to provide a res- Problem formalization and aggregation strategies olution 100 times higher than FLORKART’s quarter tiles Given an observer’s location p ∈ P as geographic coor- and results in cells of ≈ 0.33 km each. We study the dinates and date of observation d, we determine the can- impact of the computational grid’s resolution in its own didate subset T ⊆ T of all known taxa T that is most p,d subsection below. Only 20% of the grid’s cells are occu- likely to be encountered by the observer. We hypothesize pied by GBIF records with a median of 4 occurrences, the that spatial and temporal distance to registered occur- 10th percentile being 1 and the 90th percentile being 56 rence records affect an observer’s chance to encounter the records. The record frequency per occupied cell is heav- same taxa at their current location in the field. Therefore, ily unbalanced with 50% of all occurrence records being we assign each taxon t ∈ T ascore S reflecting its p,d t,p,d concentrated in merely 0.8% of the occupied cells (cp. chance of being encountered at p and d. Fig. 2(d)). Figure 2(a) visualizes occurrences’ spatial den- T = t ∈ T |S > 0 p,d i t ,p,d sity on a map of Germany with a circle depicting each The result will be a list of taxa, ranked based on scores. record and its given accuracy and each colored pixel rep- Hence, we denote a taxon’s rank by r and define the resenting an computational grid cell. The map shows that resulting ranked list of candidates T as: even though records are sparse and irregularly distributed, p,d they are spread across all parts of Germany. When clas- T = (t, r) : t ∈ T , r ∈ N : r ∈[1, |T |], p,d p,d p,d sifying record locations in terms of land cover [43], 23% ∀t , t , r , r : (t , r ) ∈ T ∧ t = t → (t , r )/ ∈ T i j i j i i p,d i j j j p,d are on non-irrigated arable land, 16% on pastures, 15% in broad-leaved forests, 14% in coniferous forests, and 10% ∀(t , r ), (t , r ) ∈ T : S ≥ S → r < r . i i j j p,d t ,p,d t ,p,d i j i i j j on discontinuous urban fabric. For our test region of Germany we study the quality of ranked candidate lists T by evaluating them based on p,d Independent test dataset the test data introduced above. Test records n = 1 ... N For obtaining an independent test set of occurrence data, are represented as a tuple containing the location p , we used the image hosting and social media website Flickr the observation date d andthelabeledtaxon t .Welet n n [44], a platform where users can upload and share per- T = T for all (p , d , t ) in our set of test records with p,d n n n n sonal photographs. We selected this service specifically n representing the index of the test query. because the uploaded images show what people actually ‘see’ and are interested in. We argue that this will to a Evaluation metrics large extent correlate with plant species people are inter- We aim to asses computed candidate subsets T in terms ested in identifying and recording during their daily life. of completeness, compactness, and efficiency of the rank- We used the Flickr API’s ’photos.search’ method to iden- ing and therefore introduce the following five metrics. tify geotagged images labeled with the scientific name (1) Average recall R measures the ratio of correctly or an accepted synonym of the 2771 taxa considered in retrieved test records in relation to all test records and is our study. From the images’ metadata we extracted the computed as timestamp and the location of acquisition. This process resulted in 28,226 records for 1271 of the 2771 studied 1, if t ∈ T n n R = R ,with R = (1) n n taxa. The summarized statistics are displayed in Fig. 3.In 0, if t ∈ / T N n n n=1 terms of geographical coverage across Germany, the test data is very sparse. Merely 0.69% of the computational Average recall is not only computed for the whole grid cells as defined above are occupied having a median retrieved list but also for subsets thereof, assessing com- of 1 and a maximum of 1,127 records each. The number pleteness up to specific list positions. R refers to the of records per occupied grid cell is biased, concentrated average recall up to rank k and is computed by cutting off mainly around major urban areas and points of interest, the list of results after the k-th position and calculating but resembles that of GBIF (cp. Fig. 3(d) with Fig. 2(d)). the average recall on the remaining sublist (cp. Eq. 1). We Regarding land cover, most record locations (24%) are on report R for k ={20, 514} with 20 items referring to a k Wittich et al. BMC Bioinformatics (2018) 19:190 Page 7 of 17 b e Fig. 3 Characteristics of the Flickr test data – a spatial density of occurrence records per grid cell across all taxa; b record distribution per month of observation; c average distance to nearest neighbor occurrence per taxon; d frequency distribution of occurrence records per grid cell; e record distribution per year of observation user-friendly shortlist of recommendations and 514 being |T | 1 LR = .(3) themediannumberoftaxapresentperFLORKARTgrid N |T | n=1 tile, reflecting the average number of taxa occurring in a local region. (4) Mean reciprocal rank MRR measures the ranking (2) Average list length LL measures the average number quality of retrieved candidate lists for a set of test records. of retrieved candidate taxa across all N test records and is The reciprocal rank is the multiplicative inverse of rank computed as r of the correct taxon for the nth test query and MRR is the average of reciprocal ranks for the whole test set of N queries. A taxon’s reciprocal rank equals 0 if it is not on LL = |T |.(2) the retrieved list T .MRRiscomputed as: n=1 1 1 MRR = ,with(t , r ) ∈ T.(4) n n n (3) Average list reduction LR measures across all N test N r n=1 records the number of retrieved candidate taxa in T in relation to the number of all known taxa T.Weintro- (5) Median rank M measures the rank which at least half duce this metric to better understand to what extent the of selected taxa are ranked higher than and therefore pro- identification problem can be simplified by reducing the vides an indication of the results’ compactness. Similar to number ofpotentialtaxa.Basedonthe totalamountof MRR, it aims to judge the quality of the ranking and where taxa |T | and the number of taxa retrieved with the nth test in the ranked list the correct taxa appear after ranking. It query |T |, LR is computed as is computed as n Wittich et al. BMC Bioinformatics (2018) 19:190 Page 8 of 17 ⎧ ⎫ |T | 1 s N N ⎨ ⎬ S = counts(t , p ).(7) i j t ,p,d i r M =min s ∈N: (t , r) ∩ T ≥ (t , r)∩T n n n n |P | t ,p ⎩ ⎭ i 2 p ∈P t ,p r=1 n=1 r=1 n=1 i (5) S2 Weighted relative frequency of occurrence records ranks taxa based on how often they occur within a radius with their proportion of contribution being We define five strategies for aggregating multiple grid reduced the farther away they occur from the center: tiles and records per taxon depending on their spatial and temporal characteristics. 1 1 S = counts(t , p ). t ,p,d i j |P | 1 + dist(p, p ) t ,p r Retrieval from grid-based presence-absence data i p ∈P t ,p In a first set of experiments, we evaluate presence-absence (8) data of the grid tile containing the test location p ∈ P and, depending on a variable radius parameter, also those in its S3 Minimum spatial distance to records’ tile centers vicinity to compute a set of candidate taxa at a given test ranks taxa within the sampling radius based on their location. Since it is not clear how accurate and up-to-date closest spatial distance to the test location: the available data is, we study how sampling within a cir- min dist(p, p ) p∈P j cle around a test point with four increasing radii (1 km, t ,p S = 1 − .(9) t ,p,d 5 km, 10 km, and 20 km) in addition to sampling at the test max dist(p, p ) p∈P j t ,p point’s true location affects the quality of retrieved candi- S4 Average spatial distance to records’ tile centers ranks date taxa T . The hypothesis being that taxa may extend p,d taxa within the sampling radius based on each taxon’s their range over time and that in cases where a test point mean spatial distance to the test location: resides close to the border of a tile, its neighbor tile may be as relevant as the containing tile itself. We include addi- r dist(p, p ) p ∈P 1 j t ,p ¯ i tional tiles if their center location p ¯ ∈ P falls within the S = 1 − . (10) t ,p,d i r |P | max dist(p, p ) ¯ j t ,p p ∈P sampling radius. The subset P ⊆ P contains tiles’ center j i t ,p locations only. In order to obtain the set of taxa T , we query the grid p,d When considering an area rather than a single point, it tiles across all taxa at a test record’s location p and within may be necessary to aggregate presence records from mul- aradius r for obtaining the taxa set T . p,d tiple tiles. We select four distinct aggregation strategies to study their effect on the quality of retrieved candidate Retrieval from point-based taxon records taxa T .For each taxon t ∈ T,wecompute ascore p,d We evaluate estimation quality based on GBIF records S based on one of these strategies and sort the list T t ,p,d p,d using the same four aggregation strategies S1 . . . S4 that accordingly. These strategies either consider the relative we studied for grid-based presence-absence data and frequency of a taxon’s occurrences within those grid tiles additionally introduce a strategy S5, which considers tem- covered by the sampling circle of radius r or a normal- poraldistancebetween thedateofatest observationand ized Euclidean distance dist(p , p ) between the test point point-based occurrence records. and eligible tiles’ centers defined as those falling within the sampling circle. S5 Temporal distance to months with recorded We let P denote the set of locations within radius r occurrences ranks taxa based on Gaussian-weighted t ,p around p at which taxon t occurs average monthly score centered at the current/test record’s month: P = p ∈ P | counts(t , p )> 0 ∧ dist(p, p ) ≤ r . 12 i i i i t ,p S = countsInMonth(t , p , m) t ,p,d i j (6) i |P | t ,p r m=1 p ∈P t ,p 1 1 2 The function counts : T × P → R yields the number − (m−month(d)) × √ e . of taxon occurrences at a location p. The following four 2π strategies S1 . . . S4 aggregate the individual contributions (11) of occurrences in P in order to compute a rank for all t ,p where the function countsInMonth : T ×P ×N → R yields t ∈ T . p,d a taxon’s chance of occurring at a particular location dur- S1 Relative frequency of occurrence records ranks taxa ing a particular month and month : date → N provides based on how often they occur within a radius of tiles the month of an observation date. S5 is only applicable being sampled: for the 86% point-based occurrence records with valid Wittich et al. BMC Bioinformatics (2018) 19:190 Page 9 of 17 timestamp. Considering the granularity in which bloom- above and are interested in understanding whether the ing periods are usually specified, we discretize records combination of both data sources allows for a more com- observation date into either one or two out of twelve plete and precise estimation of a taxon’s distribution. monthly bins proportionally to observation day’s distance Figure 4 illustrates a possible configuration of a map to the middle of the month. We define the temporal dis- segment aggregating both data sources for one taxon. tance between a test record’s month of year m ∈ N : Occurrence records with different accuracies as well as m ∈[ 1, 12] and that taxa’s occurrences as the weighted grid-based presence data at different scales contribute to sum of a taxon’s monthly scores having the maximal an average value of how likely a taxon can be expected at weight centered around the current month and decreasing a user’s location and its surroundings. both ways. Although potentially being of high precision, GPS loca- Results tions always suffer from certain spatial inaccuracies, often We assess the quality of taxa recommendations by mea- provided as an additional parameter along with the loca- suring how accurately observations from the set of Flickr tion. Over 35% of our GBIF records provide this additional test data can be retrieved and report results of a series value characterizing their spatial accuracy. For this rea- experiments on grid-based presence-absence data, point- son and to mitigate the sparsity of GBIF point data, we based occurrence records, and a combination of both. In consider each point of a recorded observation as having addition, we elaborate on how we run the experiments an influence on its surroundings. We treat coordinates of computationally efficiently. Metrics reported throughout an occurrence record as center of a circle having a radius this section include average recall (R), average list length corresponding to its uncertainty with the expectation of (LL), average list reduction (LR), mean reciprocal rank a taxon’s encounter being highest at the center while lin- (MRR) and median rank (M) as defined in the previous early decreasing concentrically. For the remaining records section. without any indication of spatial accuracy we assume a default accuracy of 500 m reflecting the average accuracy Ranked retrieval from grid-based presence-absence data of GBIF records providing this information in our study. Table 1 summarizes the results of our first set of Similar to the process described before, we query all point- experiments retrieving ranked taxa lists from grid-based based records within a radius r of a test record’s location p presence-absence data. From top to bottom, the table to sample occurrence frequencies and times for obtaining shows retrieval results at the exact location and for the taxa set T . the four aggregation strategies S1 . . . S4. Per strategy we p,d aggregate presence-absence data at four radii 1 km, 5 km, Retrieval from combined grid- and point-based data 10 km, and 20 km. The columns of the table refer to our In a final set of experiments, we investigate estimation previously introduced evaluation metrics. quality based on merged grid-based presence-absence We observe a modest average recall of 82.31% when data and point-based taxa occurrence records. We apply retrieving test observations from the grid cell at the exact thesamefiveaggregation strategies S1... S5 introduced position of a test record using solely presence-absence 01 2 arcmin Fig. 4 Grid section for a single taxon including area and point occurrences with different extents and uncertainties, respectively. The circle shows the sampling radius around the test position (red cross) being queried. The opacity of a tile is proportional to the taxon’s likelihood of being encountered there Wittich et al. BMC Bioinformatics (2018) 19:190 Page 10 of 17 Table 1 Results of ranked taxon retrieval solely using FLORKART grid-based presence-absence data sampled at the exact location and aggregated for increasing radii around Flickr test observations Radius [km] R [%] R [%] R [%] MRR [%] MLL LR 20 514 Retrieval at exact location 0 82.31 3.38 64.13 1.11 307 680 4.54 S1: Relative frequency of occurrence records 1 85.40 2.42 68.15 0.79 300 787 3.79 5 92.35 4.62 74.94 1.52 237 1115 2.59 10 94.47 4.78 74.42 1.35 234 1286 2.23 20 96.14 5.65 72.39 1.81 237 1477 1.92 S2: Weighted relative frequency of occurrence records 1 85.40 2.62 68.73 0.95 287 787 3.79 5 92.35 4.36 74.80 1.56 240 1115 2.59 10 94.47 4.74 74.70 1.55 232 1286 2.23 20 96.14 5.71 73.88 1.78 233 1477 1.92 S3: Minimum spatial distance to records’ tile centers 1 85.40 4.00 63.28 1.14 330 787 3.79 5 92.35 2.85 64.58 1.01 357 1115 2.59 10 94.47 2.13 64.25 0.80 375 1286 2.23 20 96.14 2.52 64.23 0.82 379 1477 1.92 S4: Average spatial distance to records’ tile centers 1 85.40 2.06 60.00 0.65 380 787 3.79 5 92.35 0.46 52.91 0.37 470 1115 2.59 10 94.47 0.68 46.32 0.37 520 1286 2.23 20 96.14 0.81 37.00 0.37 615 1477 1.92 data. The recall increases up to 96.14% when aggregating is less severe when relying on taxa frequency (S1 and data within radii of up to 20 km around a test location. R S2). Since every FLORKART cell only documents the and LR depend only on the sampling radius and remain presence or absence of a particular taxon and not its unaffected by the aggregation strategies S1 . . . S4. frequency, these strategies are only applicable when the While R is noticeably high meaning that an expected sampling radius spans multiple FLORKART cells. The taxon likely appears somewhere on the retrieved list, its weighted aggregation S2 additionally reduces the influ- actual rank is rarely at the top as indicated by low MRR ence of records with increasing distance from the test values. The same result is indicated by low median ranks, location, which allows a finer gradation between center e.g., in merely half of the test cases the expected taxon and neighborhood and thus more diverse score values. ranks higher than 234th place using S1 and a radius of The effectiveness of this strategy is demonstrated by a 10 km. In general, a higher recall of a larger sampling 14.8% and 318.9% increase in MRR over S1 and S4 respec- radius is achieved at the cost of an extended candidate tively as well as an improvement of the median rank list increasing from 680 taxa at the exact location to 1,477 M by 288 positions over S4 when sampling at a radius taxa at a radius of 20 km (cp. Table 1). In consequence, we of 10 km. observe relatively poor ranking quality, illustrated by low values for R and median ranks > 200 at all radii and Ranked retrieval from point-based occurrence records across all aggregation strategies. Table 2 summarizes the results of our second set of In terms of MRR, the methods relying on distances experiments on retrieving ranked taxa lists from point- between test point and quadrant centers (S3 and S4) based occurrence records. Overall, we observe consider- yield the poorest results. This can be attributed to a ably lower recall values compared to the previous set of very small variety of unique distances, i.e., most taxa experiments. At the exact location (r = 0 km), we achieve attaining the same score, which results from the com- an average recall of 36.36%. However, with an increasing paratively coarse-grained FLORKART grid. The problem sampling radius this recall grows to 85.51% at r = 20 km. Wittich et al. BMC Bioinformatics (2018) 19:190 Page 11 of 17 Table 2 Results of ranked taxon retrieval solely using GBIF point-based occurrence records sampled at the exact location and aggregated for increasing radii around Flickr test observations Radius [km] R [%] R [%] R [%] MRR [%] MLL LR 20 514 S1: Relative frequency of occurrence records 0 36.36 19.90 36.36 6.61 17 73 262.00 1 43.40 16.43 43.40 5.06 36 142 218.72 5 59.72 11.28 58.04 3.45 89 337 91.61 10 73.15 12.05 69.45 3.41 111 504 16.60 20 85.51 11.12 77.68 2.71 133 752 5.36 S2: Weighted relative frequency of occurrence records 1 43.40 18.15 43.38 5.54 30 142 218.72 5 59.72 14.31 58.54 4.30 70 337 91.61 10 73.15 13.61 70.52 4.05 89 504 16.60 20 85.51 14.98 79.73 3.77 108 752 5.36 S3: Minimum spatial distance to records’ tile centers 1 43.40 12.84 43.44 3.46 51 142 218.72 5 59.72 14.87 58.46 4.12 66 337 91.61 10 73.15 16.00 71.09 4.59 77 504 16.60 20 85.51 16.46 80.62 4.54 92 752 5.36 S4: Average spatial distance to records’ tile centers 1 43.40 14.51 43.39 4.63 55 142 218.72 5 59.72 12.91 58.50 3.99 76 337 91.61 10 73.15 10.48 70.69 2.97 110 504 16.60 20 85.51 9.68 78.83 2.83 136 752 5.36 S5: Temporal distance to months with recorded occurrences 0 36.35 23.12 36.35 7.36 13 73 261.10 1 43.39 19.81 43.39 5.81 24 141 218.97 5 59.71 12.47 58.84 3.60 77 337 91.78 10 73.15 11.21 69.95 3.08 108 503 16.67 20 85.50 7.25 77.88 1.96 168 751 5.37 We evaluated five ranking strategies for the retrieved compared to S1. We found that MRR and median rank taxa lists based on frequency, spatial distance, and tempo- improve considerably when applying S5 making this strat- ral distance of occurrences. At a radius of 0 km, aggrega- egy a promising option. Aggregating point-based records tion strategies S1 and S5 evaluate the exact computational based on minimum spatial distance (S3) at a radius of grid cell of 0.33 km a test record falls into, producing 20 km was found to be the best performing strategy, yield- highest MRR associated with lowest recall. The remain- ing R = 85.51%, MRR= 4.54%, and M = 92. ing strategies S2 . . . S4 consider spatial distance of records and can accordingly be applied only if the sampling radius Ranked retrieval from combined grid- and point-based spans multiple computational grid cells. Though yielding data the same recall at respective radii, they differ in ranking Table 3 summarizes the results of our third set of experi- quality as expressed by MRR and M. While S2 offers high- ments retrieving ranked taxa lists from a combination of est MRR up to 5 km, S3 improves for larger radii with grid-based presence-absence data and point-based occur- results for S4 falling in between. Ranking based on tem- rence records. poral distance (S5) operates on the 86% GBIF records The combination of both data sources increases recall in with an existing and valid observation time stamp alone. the computed candidate lists for all sampling radii, e.g., at This reduced set of records explains the slightly differ- r = 20 km the individual recall of 96.14% (FLORKART) a ing figures in recall, list length, and list length reduction nd 85.51% (GBIF) increase to 97.4% on the combined data. Wittich et al. BMC Bioinformatics (2018) 19:190 Page 12 of 17 Table 3 Results of ranked taxon retrieval using FLORKART presence-absence data in combination with GBIF point-based occurrence records sampled at the exact location and aggregated for increasing radii around Flickr test observations Radius [km] R [%] R [%] R [%] MRR [%] MLL LR 20 514 S1: Relative frequency of occurrence records 0 86.62 20.89 74.99 7.20 121 692 4.41 1 89.51 16.99 79.66 5.38 135 810 3.67 5 94.10 11.62 83.88 3.67 155 1142 2.54 10 95.98 12.19 83.03 3.55 160 1320 2.18 20 97.40 11.02 80.09 2.77 165 1525 1.86 S2: Weighted relative frequency of occurrence records 1 89.51 19.67 80.35 6.00 116 810 3.67 5 94.10 15.38 84.33 4.60 131 1142 2.54 10 95.98 15.16 84.38 4.28 127 1320 2.18 20 97.40 15.00 84.08 3.83 128 1525 1.86 S3: Minimum spatial distance to records’ tile centers 1 89.51 2.48 68.07 0.94 330 810 3.67 5 94.10 3.25 66.14 1.05 364 1142 2.54 10 95.98 1.90 65.72 0.84 378 1320 2.18 20 97.40 2.82 67.13 1.04 359 1525 1.86 S4: Average spatial distance to records’ tile centers 1 89.51 3.70 63.18 1.81 374 810 3.67 5 94.10 1.09 52.77 0.66 478 1142 2.54 10 95.98 0.76 45.51 0.43 529 1320 2.18 20 97.40 1.05 36.70 0.42 624 1525 1.86 S5: Temporal distance to months with recorded occurrences 0 36.35 23.15 36.35 7.37 13 73 261.10 1 43.39 19.86 43.39 5.76 25 141 218.97 5 59.71 12.52 58.82 3.60 77 337 91.78 10 73.15 11.06 69.95 3.04 108 503 16.67 20 85.50 7.22 77.87 1.98 167 751 5.37 S2+S5: Combined weighted relative frequency and temporal distance 0 86.62 23.98 75.78 8.85 133 692 4.41 1 89.51 22.09 79.65 7.51 119 810 3.67 5 94.10 17.92 84.49 5.69 118 1142 2.54 10 95.98 18.14 85.25 5.12 112 1320 2.18 20 97.40 17.14 85.52 4.61 115 1525 1.86 Even more beneficial is the combination in terms and M. For S2+S5, the 10th percentile rank is 521, the 90th of achieved ranking quality resulting in significantly percentile 8 and the median rank is 118. Figure 5 shows improved results. Improvements are, for example, the distribution of ranks for the correct taxon per test reflected in higher mean reciprocal rank (1.81% vs. 5.69%) record across all individual ranking strategies (S1 . . . S5) and improved median rank (237 vs. 118) (cp. Table 1,S1 and the combination of spatio-temporal ranking (S2+S5) at an aggregation radius of 5 km. The figure shows that the at 20 km with Table 3,S2+S5 at5km). In addition to evaluating the scoring methods by them- correct taxon is ranked more frequently near the begin- selves, we also studied linear combinations of those and ning of the list for S1, S2, S5, and S2+S5 and declining found weighted spatial frequency with temporal scoring towards the end. The combination of S2+S5 shows addi- (see S2+S5 in Table 3) to yield the highest impact on MRR tional benefits especially for the top ranks. S3 and S4 Wittich et al. BMC Bioinformatics (2018) 19:190 Page 13 of 17 0.1 0.01 S1 S2 S3 S4 0.1 S5 S2+S5 0.01 0 1 2 3 10 10 10 10 Rank Fig. 5 Relative and cumulative frequency per rank of correct taxon for recommending Flickr test records from FLORKART and GBIF datasets, using a search radius of 5 km and six different ranking strategies. The dashed vertical lines mark the median of each distribution suffer from more evenly distributed frequencies over most 90% occurrence records. Each figure in the table is an aver- ranks with a visible maximum around their respective age across the ten cross-validation runs. The results show median beyond the 350th rank. that recall R as well as R are well above 99% in all three We also wanted to assess the influence that a richer areas. High median ranks of 33 up to 17 and a R of 38% set of point-based occurrence records could have on our to 56% show the potential of predicting the sought-after result. Therefore, we selected the three sites of the Biodi- taxon near the very top of a recommendation list. versity Exploratories project [10]: (a) Schorfheide-Chorin, (b) Hainich-Dün and (c) Schwäbische Alb as test cases. Considerations on computational efficiency 2 2 The sites span areas from 422 km to 1300 km and have Apart from the influencing factors presented above, the been intensively investigated for plant species occurrences quality of the taxa list depends on an actual implemen- during systematic observations performed since 2006. tation. One important consideration is the resolution The data is available on GBIF. However, our Flickr test of the computational grid used for binning occur- observations proved to be very sparse for these regions rence records within close distance. A trade-off between with merely 13 records in the area of (a), 113 at (b), and required resources in terms of time and space and poten- 15 at (c). Given the very rich set of GBIF observations, tial for improving evaluation metrics has to be made. we decided to perform a 10-fold cross-validation using We therefore varied the parameter of computational grid 10% randomly selected GBIF occurrence records from the resolution while utilizing the best performing combined three areas (N = 76, 696; N = 101, 504; N = 104, 968) aggregation strategy S2+S5 with a sampling radius of a b c as test set and only the remaining 90% as occurrence r = 10 km on joint FLORKART and GBIF data. Our records. Table 4 reports results for the best performing implementation in C++ uses OpenMP to optimize for aggregation method yet (S2+S5) and the combined taxa parallel execution where possible and was run on a state- information consisting of presence-absence data and the of-the-art 10-core, 128GB RAM workstation. Resolution, Table 4 Results of ranked taxon retrieval in selected regions using combined using FLORKART areal data with 10-fold cross-validation on GBIF point data Region R [%] R [%] R [%] MRR [%] MLL LR 20 514 (a) Schorfheide-Chorin 99.95 56.39 99.86 17.42 17 943 2.95 (b) Hainich-Dün 99.72 48.16 99.59 13.08 22 1058 2.65 (c) Schwäbische Alb 99.95 38.03 99.83 10.47 33 935 2.98 Cumulative Relative frequency [%] frequency [%] Wittich et al. BMC Bioinformatics (2018) 19:190 Page 14 of 17 expressed in relation to the quarter MTB tiles originally binary data without any notion of abundance. Using solely used to record presence-absence data, gradually increases presence-absence data means that a rarely observed taxon from top to bottom in Table 5. Results show R remain- will be ranked exactly the same as another, potentially very ing around 96%, while R and R increase slightly and common one that occurs within the same grid tile. 20 514 themedianrankimprovesupto28placesatfiner resolu- tions. We suspect that GBIF data is too sparse for a finer Point-based occurrence records resolution to have a more pronounced impact. The dis- GBIF point-based occurrence records are spatially sparse cretization also introduces rounding errors which distort and irregularly spread across the study region. Contrary to the results. Given the best tradeoff between R and M,we the presence-absence data, they have not been systemati- settled on a 0.33 km tile size being of 100 times finer cally sampled. Accordingly, we observe considerably lower granularity than FLORKART quarter tiles. This granu- average recall at the location of a test record. Using a larity provides the lowest median rank of 114 and an larger sampling radius leads to substantially higher recall. overall recall of 95.98%, it has been used for all other At the largest evaluated radius of 20 km, we achieve a computations in this paper. recall of 86% and an average candidate list length of 752 taxa. This list length is comparable to that computed Discussion based on the systematically sampled FLORKART data Grid-based presence-absence data at comparable recall, i.e., 787 at 85%. This result raises Noticeably, recall does not reach 100% using grid- expectations towards future use of GBIF data with its con- based FLORKART presence-absence data, but shows an tinuously increasing number of records. GBIF data offers increase when sampling a larger radius around the test an insight that presence-absence data do not provide. location. While this may indicate that taxa extended Multiple records of the same taxon in close proximity can their range since they were observed for FLORKART, be aggregated into an observation frequency allowing us it mainly suggests that our test data, being more rep- to estimate which taxa a user would more likely try to resentative of observations an interested hobbyist rather identify. Using this information, we observe a substantially than a botanist may acquire in the field, are not accu- higher mean reciprocal rank and an improved median rately captured by FLORKART information alone. Flickr rank across all evaluated aggregation strategies S1 . . . S5. test records come from a multitude of users and also We found the minimum spatial distance S3 between a consist of cultivated plants observed in urban environ- test record and existing GBIF records to yield the best ments, e.g., city parks and (botany) gardens. Accordingly, ranking results. the ten taxa most frequently failing correct prediction include ornamental and garden plants, such as Narcissus Combined grid- and point-based data pseudonarcissus (Easter Lily), Helleborus niger (Christmas Occurrence records contributed to GBIF via citizen Rose), Eranthis hyemalis (Winter Aconite), Helianthus science projects are not limited to wildlife plant observa- annuus (Common Sunflower), and Leucanthemum vul- tions. Therefore, using both data sources in combination gare (Common Daisy) as well as cultivated and medicinal mitigates the missing predictions of taxa that are hard plants, such as Brassica napus (Rapeseed), Cornus mas to estimate based on wildlife presence-absence data (Cornelian Cherry), Eschscholzia californica (California alone. We found that combining data sources yields poppy), and Prunus cerasifera (Cherry Plum). We should the highest recall across all experiments with a max- therefore seek to include taxa whose presence is not cap- imum of 97.4% at a sampling radius of r = 20 km. tured in wildlife presence-absence data. In addition to the This result demonstrates that the different data mediocre retrieval performance, we also observe a rel- sources are in fact complementary. Taxa that gain the atively poor ranking quality as a direct result of using largest absolute improvement by combining data are Table 5 Influence of grid resolution on evaluation metrics for S2+S5 and r = 10 km ×Quarter Avg. Area Run- RAM RR R MRR M LL LR 20 514 MTB tile [ km ] time [GB] [%] [%] [%] [%] 4 131.49 1.0× 0.5 96.45 16.14 84.00 4.92 140 1,349 2.12 1 32.87 1.1× 0.7 95.79 16.60 84.91 5.36 126 1,285 2.24 1/16 2.05 4.9× 5.7 96.20 17.85 85.13 5.36 114 1,331 2.16 1/64 0.51 15.4× 21.0 95.93 18.24 85.26 5.21 116 1,327 2.17 1/100 0.33 20.5× 33.2 95.98 18.19 85.24 5.14 112 1,320 2.18 1/144 0.23 29.6× 47.0 95.97 18.22 85.23 5.04 115 1,323 2.17 Wittich et al. BMC Bioinformatics (2018) 19:190 Page 15 of 17 Leucanthemum vulgare, Prunus cerasifera, Narcissus combine other data sources and to possibly increase res- pseudonarcissus, Eranthis hyemalis,Cornus mas, Helle- olution and precision of our estimations. To rule out the borus niger,and Brassica napus. Although the recall possibility of our own discretization having an adverse improves by combining data, it still does not reach 100%, effect on data quality, we evaluated results across multiple i.e., retrieved taxa lists are still incomplete with respect to resolutions as one aspect of our study. the test observations obtained from Flickr. This is in part Although being high, recall does not reach 100% in our due to some locations and taxa which yield exceedingly experiments. One possible explanation is insufficient data low recall, i.e., false negatives when evaluated on the test quality since our datasets originate from manual acqui- data. False negatives dominantly occur at urban land sition processes. Revising maps with an extent such as cover types [43], i.e., discontinuous urban fabric (32%) FLORKART is an ongoing process that can never be and sport and leisure facilities (13%). Taking a closer look expected to be complete. The range of species is highly attheresultsofS2+S5at r = 5 km, the average recall is dynamic as a consequence of, e.g., climatic differences and only 94.10% due to 345 individual taxa not being retrieved changes in land use. Some observations date back several in the missing 5.90%. Among the top 66% of these 345 decades while even the more current ones originate from taxa, are 90.7% crop and garden plants. The top three are mapping projects carried out in at least 47 federal project Brassica napus, Narcissus pseudonarcissus,and Cornus regions. GBIF’s observation records have been collected mas. These three taxa account for 13% of the missing in an even more irregular manner, e.g., including citizen- recall alone. science projects. We were able to mitigate some prob- In terms of candidate list ranking, we observed the best lems by analyzing data quality and eliminating erroneous results by combining spatially weighted occurrence fre- records based on a set of heuristics (e.g., implausible dates quencies (S2) and temporal distance (S5) shown by con- and locations). sistently highest MRR values. Improved ranking allows for We purposely chose Flickr observations as test data shorter candidate lists, which for instance is supported by since they reflect potential users and resemble a use R reaching a plateau around 84.5% at r = 5 km, indicat- case in which a taxon recommendation system could be ing a high chance of including the correct taxon before the applied. For instance, some test records are taken in urban 514th rank. An average list length of 1,142 at that distance environments (cp. Fig. 3), such as city parks, botany gar- shows that one would need to consider only 41% of all dens and backyards. However, the data is neither curated taxa of interest in Germany at a given location. Depend- nor verified by experts and is therefore expected to have errors, although verification of user-provided tags ing on the intended use case a compromise between recall and mean reciprocal rank has to be made. For a list as through image classification may yield improvements. complete as possible one would consider a larger area to Flickr records may be imprecise in the labeled taxa as well be sampled whereas a greater list length reduction can be as the recorded location. In extreme cases, images may not achieved by sampling smaller regions. be taken at the place of the original taxon occurrence, e.g., An additional evaluation only at the three Biodiver- images of Abies normannia could show a Christmas tree sity Exploratory sites yielded recall close to 100% and a in a living room. On the upside, this provides a chance of remarkable 56% chance of the correct taxon being among seeing results evaluated under a worst-case scenario. By the top 20 positions of a retrieved list. This result is very conducting a cross-validation with GBIF records, we were promising and shows how results can be improved with able to show that our underlying method can yield results more point-based observation records in the future. of much higher quality when operating on a richer and more fine-grained dataset. Limitations On average, our recommended list contains 1,142 taxa Conclusions using a sampling radius of 5 km and S2+S5 strategy on Recommending a list of plant taxa most likely to be combined observation data corresponding to a list reduc- observed at a given geographical location and time is tion of 2.54. Despite being substantially reduced, the list useful for species identification as well as biodiversity is still long prompting us to understand whether the research. We studied achievable recommendation quality retrieved length is plausible. Studies [11, 12] recording based on two fundamental types of information, individu- species richness with respect to land cover found a total ally and in combination: binary presence-absence data and of 623 and 546 vascular plant species on grassland and individually collected occurrence records. Furthermore, forest plots, respectively. Since we do not consider land we aggregated data with increasing sampling radii around covertypes forour studyandbase ourestimations on test locations and according to five formally defined aggre- FLORKART data with a maximal resolution of 30 km and gation strategies. Additionally, we investigated the influ- a median number of 514 taxa per tile, we consider the ence of data discretization granularity on recommenda- resulting list lengths plausible. It is a future exercise to tion quality as well as on computational efficiency. Wittich et al. BMC Bioinformatics (2018) 19:190 Page 16 of 17 When relying solely on presence-absence data, the cur- References 1. Elphick CS. How you count counts: the importance of methods research rent state-of-the-art when looking for taxa that occur at in applied ecology. J Appl Ecol. 2008;45(5):1313–20. a certain location, we managed to retrieve merely 82.31% 2. Brach AR, Boufford DE. Why are we still producing paper floras?. Ann Mo of the test records, recommending the correct one at the Bot Gard. 2011;98(3):297–300. 3. Farnsworth EJ, Chu M, Kress WJ, Neill AK, Best JH, Pickering J, 307th place in the list on average. By combining both Stevenson RD, Courtney GW, VanDyk JK, Ellison AM. Next-generation data sources, increasing the sampling radius, and using a field guides. BioScience. 2013;63(11):891–9. sophisticated aggregation strategy we were able to retrieve 4. Austen GE, Bindemann M, Griffiths RA, Roberts DL. Species identification by experts and non-experts: comparing images from field guides. Sci Rep. 95.98% of the test records, recommending the correct one 2016;6:. on average at the 112th place in the list. When focus- 5. Ceballos G, Ehrlich PR, Barnosky AD, García A, Pringle RM, Palmer TM. ing on regions heavily sampled in terms of occurrence Accelerated modern human–induced species losses: Entering the sixth mass extinction. Sci Adv. 2015;1(5):. https://doi.org/10.1126/sciadv. records, we even retrieved more than 99% of the test 1400253. http://advances.sciencemag.org/content/1/5/e1400253.full.pdf. records’ taxa with the sought-after one ranking on aver- 6. Hopkins G, Freckleton R. Declines in the numbers of amateur and age at the 24th place. In conclusion, we found that both professional taxonomists: implications for conservation. In: Animal Conservation Forum. Cambridge University Press; 2002. p. 245–9. studied data sources are highly complementary for use in 7. Wisskirchen R, Haeupler H. Standardliste der Farn- und Blütenpflanzen a recommendation system. We demonstrated that such a Deutschlands. Stuttgart: Eugen Ulmer; 1998. system can be highly efficient in reducing the search space 8. Preston CD, Pearman D, Dines TD, et al. New Atlas of the British & Irish Flora. Oxford: Oxford University Press; 2002. for species identification tasks with on average only 41% 9. Flora of North America Editorial Committee. Flora of North America: North of all taxa needing to be considered at a given location. We of Mexico. New York and Oxford: Oxford University Press; 1993. also demonstrated that with the ongoing growth of species 10. Fischer M, Bossdorf O, Gockel S, Hänsel F, Hemp A, Hessenmöller D, Korte G, Nieschulze J, Pfeiffer S, Prati D, Renner S, Schöning I, occurrence records in repositories like GBIF these results Schumacher U, Wells K, Buscot F, Kalko EKV, Linsenmair KE, Schulze E-D, will constantly improve even further. Weisser WW. Implementing large-scale and long-term functional biodiversity research: The biodiversity exploratories. Basic Appl Ecol. 2010;11(6):473–85. https://doi.org/10.1016/j.baae.2010.07.009. Acknowledgements 11. Socher SA, Prati D, Boch S, Müller J, Baumbach H, Gockel S, Hemp A, We acknowledge support for the Article Processing Charge by the Thuringian Schöning I, Wells K, Buscot F, Kalko EKV, Linsenmair KE, Schulze E-D, Ministry for Economic Affairs, Science and Digital Society and the Open Access Weisser WW, Fischer M. Interacting effects of fertilization, mowing and Publication Fund of the Technische Universität Ilmenau. grazing on plant species diversity of 1500 grasslands in germany differ between regions. Basic Appl Ecol. 2013;14(2):126–36. https://doi.org/10. Funding 1016/j.baae.2012.12.003. We are funded by the German Ministry of Education and Research (BMBF) 12. Boch S, Prati D, Müller J, Socher S, Baumbach H, Buscot F, Gockel S, grants: 01LC1319A and 01LC1319B; the German Federal Ministry for the Hemp A, Hessenmöller D, Kalko EKV, Linsenmair KE, Pfeiffer S, Environment, Nature Conservation, Building and Nuclear Safety (BMUB) grant: Pommer U, Schöning I, Schulze E-D, Seilwinder C, Weisser WW, 3514 685C19; and the Stiftung Naturschutz Thüringen (SNT) grant: Wells K, Fischer M. High plant species richness indicates management- SNT-082-248-03/2014. related disturbances rather than the conservation status of forests. Basic Appl Ecol. 2013;14(6):496–505. https://doi.org/10.1016/j.baae.2013. Availability of data and materials 06.001. The list of Flickr photo IDs comprising our test data set as well as the project’s 13. Wäldchen J, Rzanny M, Seeland M, Mäder P. Automated plant species source code are available at https://sites.google.com/site/specrecbmc.The identification – trends and future directions. PLoS Comput Biol. GBIF occurrence data set is available at https://doi.org/10.15468/dl.5zmlxt. 2018;14(4):1005993. https://doi.org/10.1371/journal.pcbi.1005993. 14. GBIF: The Global Biodiversity Information Facility. What is GBIF? [12th Authors’ contributions October 2017]. 2017. Available from http://www.gbif.org/what-is-gbif. Funding acquisition: PM, JW; experiment design: MS, HCW, JW, PM, MR; data 15. Elith J, Leathwick JR. Species Distribution Models: Ecological Explanation analysis: HCW, MS; data visualization: HCW; writing manuscript: HCW, MS, PM, and Prediction Across Space and Time. Annu Rev Ecol Evol Syst. 2009;40: JW, MR; all authors read and approved the final manuscript. 677–97. https://doi.org/10.1146/annurev.ecolsys.110308.12015. 16. Cassini MH. Ecological principles of species distribution models: the Ethics approval and consent to participate habitat matching rule. J Biogeogr. 2011;38(11):2057–65. https://doi.org/ Not applicable. 10.1111/j.1365-2699.2011.02552.x. 17. Beck J, Böller M, Erhardt A, Schwanghart W. Spatial bias in the gbif database and its effect on modeling species’ geographic distributions. Competing interests Ecol Inform. 2014;19:10–15. https://doi.org/10.1016/j.ecoinf.2013.11.002. The authors declare that they have no competing interests. 18. Hernandez PA, Graham CH, Master LL, Albert DL. The effect of sample size and species characteristics on performance of different species Publisher’s Note distribution modeling methods. Ecography. 2006;29(5):773–85. Springer Nature remains neutral with regard to jurisdictional claims in https://doi.org/10.1111/j.0906-7590.2006.04700.x. published maps and institutional affiliations. 19. Wisz MS, Hijmans RJ, Li J, Peterson AT, Graham CH, Guisan A, Group NPSDW. Effects of sample size on the performance of species distribution Author details models. Divers Distrib. 2008;14(5):763–73. https://doi.org/10.1111/j.1472- Institute for Computer and Systems Engineering, Technische Universität 4642.2008.00482.x. Ilmenau, Helmholtzplatz 5, 98693 Ilmenau, Germany . Department 20. Graham CH, Ferrier S, Huettman F, Moritz C, Peterson AT. New Biogeochemical Integration, Max-Planck-Institute for Biogeochemistry, developments in museum-based informatics and applications in Hans-Knöll-Str. 10, 07745 Jena, Germany . biodiversity analysis. Trends Ecol Evol. 2004;19(9):497–503. 21. Araújo MB, Guisan A. Five (or so) challenges for species distribution Received: 15 December 2017 Accepted: 14 May 2018 modelling. J Biogeogr. 2006;33(10):1677–88. https://doi.org/10.1111/j. 1365-2699.2006.01584.x. Wittich et al. BMC Bioinformatics (2018) 19:190 Page 17 of 17 22. Jiménez-Valverde A, Lobo JM, Hortal J. Not as good as they seem: the 42. GBIF RESTful JSON-based API. http://api.gbif.org/v1. Accessed 3 May 2017. importance of concepts in species distribution modelling. Divers Distrib. 43. Bossard M, Feranec J, Otahel J. CORINE land cover technical guide - 2008;14(6):885–90. https://doi.org/10.1111/j.1472-4642.2008.00496.x. Addendum 2000, Technical report No 40. Copenhagen: European 23. Elith J, Graham CH, Anderson RP, Dudík M, Ferrier S, Guisan A, Hijmans Environment Agency; 2000. RJ, Huettmann F, Leathwick JR, Lehmann A, Li J, Lohmann LG, Loiselle 44. Flickr Photo/video Hosting Service. https://www.flickr.com. Accessed 4 BA, Manion G, Moritz C, Nakamura M, Nakazawa Y, Overton JM, May 2017. Townsend Peterson A, Phillips SJ, Richardson K, Scachetti-Pereira R, Schapire RE, Soberón J, Williams S, Wisz MS, Zimmermann NE. Novel methods improve prediction of species’ distributions from occurrence data. Ecography. 2006;29(2):129–51. https://doi.org/10.1111/j.2006.0906- 7590.04596.x. 24. Jetz W, McPherson JM, Guralnick RP. Integrating biodiversity distribution knowledge: toward a global map of life. Trends Ecol Evol. 2012;27(3): 151–9. https://doi.org/10.1016/j.tree.2011.09.007. 25. Goldsmith GR, Morueta-Holme N, Sandel B, Fitz ED, Fitz SD, Boyle B, Casler N, Engemann K, Jørgensen PM, Kraft NJB, McGill B, Peet RK, Piel WH, Spencer N, Svenning J-C, Thiers BM, Violle C, Wiser SK, Enquist BJ. Plant-o-matic: a dynamic and mobile guide to all plants of the americas. Methods Ecol Evol. 2016;7(8):960–5. https://doi.org/10.1111/2041-210X. 26. Phillips SJ, Dudík M. Modeling of species distributions with maxent: new extensions and a comprehensive evaluation. Ecography. 2008;31(2): 161–75. https://doi.org/10.1111/j.0906-7590.2008.5203.x. 27. Wäldchen J, Mäder P. Plant species identification using computer vision techniques: A systematic literature review. Archiv Comput Methods Eng. 20171–37. https://doi.org/10.1007/s11831-016-9206-z. 28. Seeland M, Rzanny M, Alaqraa N, Wäldchen J, Mäder P. Plant species classification using flower images – a comparative study of local feature representations. PLOS ONE. 2017;12(2):1–29. https://doi.org/10.1371/ journal.pone.0170629. 29. Rzanny M, Seeland M, Wäldchen J, Mäder P. Acquiring and preprocessing leaf images for automated plant identification: understanding the tradeoff between effort and information gain. Plant Methods. 2017;13(1):97. https://doi.org/10.1186/s13007-017-0245-8. 30. Hofmann M, Seeland M, Mäder P. Efficiently annotating object images with absolute size information using mobile devices. Int J Comput Vis. 2018. https://doi.org/10.1007/s11263-018-1093-3. 31. Berg T, Liu J, Lee SW, Alexander ML, Jacobs DW, Belhumeur PN. Birdsnap: Large-scale fine-grained visual categorization of birds. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE; 2014. p. 2019–26. https://doi.org/10.1109/CVPR.2014.259. 32. Tang K, Paluri M, Fei-Fei L, Fergus R, Bourdev L. Improving image classification with location context. In: 2015 IEEE International Conference on Computer Vision (ICCV). Boston: IEEE; 2015. p. 1008–16. https://doi. org/10.1109/ICCV.2015.121. 33. Barry S, Elith J. Error and uncertainty in habitat models. J Appl Ecol. 2006;43(3):413–23. https://doi.org/10.1111/j.1365-2664.2006.01136.x. 34. Chandler M, See L, Copas K, Bonde AMZ, López BC, Danielsen F, Legind JK, Masinde S, Miller-Rushing AJ, Newman G, Rosemartin A, Turak E. Contribution of citizen science towards international biodiversity monitoring. Biol Conserv. 2016. https://doi.org/10.1016/j.biocon.2016.09. 35. EDIT Platform for Cybertaxonomy. http://api.cybertaxonomy.org/ rl_standardliste. Accessed 3 May 2017. 36. Müller F, Ritz CM, Welk E, Wesche K. Rothmaler-Exkursionsflora Von Deutschland: Gefäßpflanzen: Kritischer Ergänzungsband. Berlin, Heidelberg: Springer; 2016. 37. Netzwerk Phytodiversität Deutschland und Bundesamt für Naturschutz. Verbreitungsatlas der Farn- und Blütenpflanzen Deutschlands, LV-Buch. Münster: Landwirtschaftverlag; 2013. 38. FloraWeb. Daten und Informationen zu Wildpflanzen und zur Vegetation Deutschlands. http://www.floraweb.de. Accessed 3 May 2017. 39. Kühn I, Brandl R, Klotz S. The flora of german cities is naturally species rich. Evol Ecol Res. 2004;6(5):749–64. 40. Kühn I, Bierman SM, Durka W, Klotz S. Relating geographical variation in pollination types to environmental and spatial factors using novel statistical methods. New Phytologist. 2006;172(1):127–39. 41. GBIF.org (23rd August 2017) GBIF Occurrence Download. https://doi.org/ 10.15468/dl.5zmlxt. Accessed 23 Aug 2017.

Journal

BMC BioinformaticsSpringer Journals

Published: May 30, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off