1. Introduction
Frequent and accurate assessments of vegetation are critical for managing forest biomass, designing effective mitigation strategies in anticipation of wildland fires, and managing vegetation in the wildland urban interface (WUI) [
1,
2,
3,
4]. Moreover, quality information about forest structure and biomass helps analysts develop smarter logging strategies, predict economic yield, and analyze the effects of vegetation in socio-technical systems. Vegetation, of all the common geographic information systems (GIS) data layers, is arguably the most dynamic due to constantly changing shape and density, a trend that appears to be accelerating with climate change [
5]. An accurate model of vegetation height, which we refer to as a
vegetation surface model (VSM), provides important insights for environmental planning, risk management, and economic benefits.
Current methods of remote sensing for detailed vegetation information include manual ground surveys, aerial photogrammetry, synthetic aperture radar (SAR), and light detection and ranging (LiDAR) [
6,
7,
8,
9]. While these methods are widely used, they are all subject to multiple limitations. Ground surveys are often time-consuming and slow, while aerial photogrammetry requires premeditated collection of multiple images at specific angles. This is not only time-consuming, but difficult to scale across large swaths [
10]. X-band SAR has been used to generate VSMs using radio waves, although the process requires special antennas installed on aircraft or spacecraft. The accuracy and resolution of SAR is heavily dependent on the relationship between stem biomass and characteristics of the vegetation. This relationship is unique to specific vegetation species, preventing techniques from being truly general [
8,
11,
12,
13,
14]. Both airborne and terrestrial-based
LiDAR involve actively measuring and recording complex three-dimensional (3D) vegetation to generate an accurate VSM from a LiDAR
point cloud (
Figure 1a). LiDAR is the current state-of-the-art for VSMs, although has the trade-off of being limited to a range of hundreds of meters. This relatively short range requires low-altitude aircraft and imposes similar restrictions to photogrammetry in terms of scalability and recurrence [
15,
16]. Although the cost of LiDAR scanners is bound to decrease in the future, the time consumption of collection will remain costly, continuing to limit the recurrence of widespread airborne LiDAR. Furthermore, LiDAR data contains bird strikes, power-lines, and water body absorption, which may not represent the true landscape. While algorithms exist to mitigate known sources of errors, post-processing can introduce other sources of noise in the LiDAR point cloud. Only a fraction of the continental United States has been scanned by LiDAR due to the immense cost and collection burden, and most areas are scanned only once with no projected re-scan rate [
17].
The limitations of current methods have limited the potential for applications, due to low recurrence rates of accurate VSMs. With the rise in popularity of deep machine learning, we postulate that new intelligent methods of remote sensing must overcome current time and cost limitations. We propose extracting detailed information from frequent and relatively inexpensive sources of data, which will further broaden the scope of applications that require accurate and frequent remote sensing [
18].
Multispectral aerial imagery is relatively inexpensive and well suited for recording two-dimensional (2D) data of the Earth’s surface at different spatial resolutions (
n-m resolution maps to a
m square on the ground). Satellites and planes have captured images of the Earth for decades, and today companies like
DigitalGlobe [
19] and
Planet [
20] collect high resolution imagery of earth with high recurrence. In addition, elevation layers such as digital elevation models (DEMs) which are publicly available at various spatial resolutions through the United States Geological Survey (USGS) [
21] and other organizations worldwide. We refer to DEMs and other raster layers derived from elevation analysis with the term
terrain data.
Previous work has explored the possibility of extracting a VSM from low-resolution multispectral images (
Figure 1c) [
22,
23,
24]; however, current practical applications often require higher resolution data. For example, wildland fire suppression is most effective when direct attack methods are used and enables firefighters to work close to the fire perimeter. Current state-of-the-art fire models use a
m spatial resolution imagery to predict fire spread [
25,
26], causing a spatial disconnect between models and fire fighting strategies on the ground. Furthermore, since current methods have limited range and recurrence, up-to-date information from current practices would require premeditated scanning strategies from recently before the fire, when smoke occlusion is not an issue. This would require accurate predictions of where fires will happen beforehand, which is not feasible. Contrarily, highly recurrent aerial imagery would likely have a recent image prior to the fire.
We developed Y-NET, a novel deep learning model to overcome the scanning recurrence and range limitations of current methods, and to improve the spatial resolution of previous related work. Y-NET generates a VSM with
m resolution given multispectral 2D aerial imagery (
Figure 1b) and terrain data. Y-NET’s architecture is designed with
a priori knowledge of the input data to learn from imagery and terrain inputs separately. Y-NET combines the compressed data together in latent space for accurate vegetation height estimations. To validate the efficacy of our approach, we use aerial imagery obtained from the United States National Agriculture Imagery Program (NAIP) [
27], and DEMs and LiDAR scans from the USGS [
21], all of which are freely available to the public. We evaluated Y-NET in the East San Francisco Bay Area, a highly-researched and high-risk area of wildland fire, where mitigation strategists are actively looking for high resolution modeling through multi-million dollar government funded research grants [
28,
29,
30].
Y-NET uses supervised learning, meaning the model trains on landscape
A for which LiDAR data are included as a ground truth measure. Y-NET is then evaluated on landscape
B, where the model only receives imagery and terrain data to recreate the 3D structure of
B. Regions in landscape
A are separated into commonly named and disjoint
training and
validation datasets for training. Landscape
B composes the
testing dataset, which is completely spatially separate from landscape
A. To evaluate the performance of Y-NET, we compare the generated VSM for landscape
B with the corresponding LiDAR and calculate the pixel-wise error in height, consistent with other work [
14,
31,
32]. We find empirically that Y-NET achieves high accuracy and
value with
m spatial resolution data by formalizing the problem as being similar to semantic image segmentation (similar to past work [
33,
34,
35]).
While we evaluated Y-NET for
spatial generalizability (training on
A, testing on
B), we expect that
temporal generalizability may also be possible given available data (training on
, testing on
where
t and
i represent time). However, a proper evaluation of temporal generalizability requires high-resolution multispectral imagery and LiDAR from two distinct times for the same location. We observed two LiDAR scanning missions within our study site, one from 2007 and one from 2018. Since four-band multispectral NAIP imagery was not collected in our study site in 2007, we were unable to perform a proper temporal generalizability evaluation. However, in
Section 4 we visually compared Y-NET’s VSM with 2016 four-band NAIP imagery to the 2007 LiDAR in our study site, and show that Y-NET can effectively identify vegetation mitigation and growth given temporal differences. We leave a full in-depth analysis of temporal generalizability to future work reliant on available data.
The contributions of our work are three-fold: (i) we developed Y-NET, a novel deep learning model capable of highly accurate VSM generation given four-band multispectral imagery and terrain data, (ii) we evaluated Y-NET on real-data from a highly-studied region at high risk of wildland fire, both statistically and visually for spatial generalizability, and (iii) we visually evaluated the potential for temporal generalizability of Y-NET and show it can identify vegetation mitigation and growth.
4. Discussion
Although we envision Y-NET to be capable given frequently collected multispectral imagery from satellites, we validate our model using NAIP imagery collected by aircraft because it is in the public domain. As NAIP imagery is obtained from lower elevation flights, there is expected relief displacement [
82] for the canopies of trees, which is radial from the center of the image, an imperfection not as prevalent in satellite imagery [
83]. With relief displacement, tree canopies are projected slightly further away from the tree center, and the displacement is increased with steeper topography. Although NAIP uses
orthoimagery to account for this, the performance of the USGS’s algorithm to correct relief displacement may cause slight misalignment between NAIP and LiDAR in the same projection. We hypothesize this may be the cause of the error seen around the circumference of tree canopies shown in
Figure 8 for tiles
a and
f, and could be minimized when using privately collected satellite imagery.
Furthermore, it is also important to note that while the LiDAR in our study was recorded from December 2017 to May 2018, the NAIP imagery was taken in July 2018. According to the Moraga-Orinda Fire District webpage, the government-mandated vegetation mitigation deadline within our study site is 15 June [
81]. This means that the grasslands are expected to be taller in the LiDAR than what is present in the NAIP image; therefore, we suspect the 80th percentile error for 0–0.6 of 0.24 m to actually be slightly exaggerated. While collecting LiDAR and imagery on the same day would allow for an ideal performance analysis, the time consumption and cost of collecting LiDAR makes this infeasible for large swaths of land.
4.1. Long CDF Tails
Perhaps more beneficial to our evaluation than observing overall performance metrics is to analyze instances where Y-NET’s predictions are the most
statistically inaccurate. As shown by the long tails of the CDFs in
Figure 7, a small amount of estimations are extremely incorrect, essentially skewing our statistical results in
Table 2. We emphasize
statistically inaccurate because we observe a small set of examples in our testing dataset with high error, and find that they are likely due to sources of noise that were not removed by our DBSCAN algorithm.
We identify two main instances of noise across the 2018 LiDAR dataset for our study site shown in
Figure 10: LiDAR processing noise and power lines. We use a height color map ranging from blue (short) to yellow (tall) to highlight the anomalous variations more than using the overlaid NAIP image. Seen on the left of
Figure 10, we observe random irregularities of extremely high values which are not characteristic of the landscape and causes high performance error for our model.
The second source of high estimation error we identify as power lines, shown on the right of
Figure 10. As power lines are physical objects above the Earth’s surface, they are recorded as being the first object the LiDAR laser contacted, and have enough neighboring pixels around the same height to avoid being identified as noise. Furthermore, power lines are too small in the NAIP imagery to be removed by our footprints masking layer, causing the nDSM height for these pixels to be extremely high. The calculated error for Y-NET in instances with power lines is enormous, despite the Y-NET modeling the vegetation surface below, shown in
Figure 10.
4.2. Vegetation Mitigation and Growth
While our evaluation shows the ability for Y-NET to generalize to spatially disjoint locations, we acknowledge Y-NET could be used in practice to identify temporal changes of vegetation, such as mitigation efforts to reduce fire risk and vegetation growth. For example, temporal generalizability involves training on a landscape at one time , and generating a VSM for the same landscape given future m resolution multispectral images , where i is a difference in time. In order to properly evaluate temporal generalizability, LiDAR and multispectral imagery must be available for the same location at two different times. As this data does not exist for our study site, we are unable to statistically evaluate the efficacy of Y-NET for temporal generalizability.
As a result, we perform a secondary evaluation of Y-NET with 2016 NAIP imagery and visually compare the generated VSM with LiDAR (nDSM) from 2007 to draw interesting temporal conclusions. The 2016 NAIP data also shows Y-NET can generalize to images collected with different environmental and atmospheric conditions present during different imaging missions.
Figure 11 contains three tile locations from this evaluation, including the red band of NAIP as it enhances tree canopies. Columns
a and
b show two instances of mitigation highlighted by red boxes, where trees had been removed between 2007 and 2016. We clearly see large differences in row 4 between Y-NET and the 2007 nDSM, however the red band in row 1 shows no tree present in that location in 2016.
Column c shows an instance of vegetation growth, where row 1 shows a mature tree that casts a significant shadow, however the 2007 nDSM shows vegetation with a height of only about 3 m. We suspect the tree grew significantly from 2007 to 2016, and Y-NET identifies the tree with a height of 8 m in 2016, visually more representative of the vegetation in the image.
4.3. Impact
We expect Y-NET to have a significant impact on how 3D data is obtained and to support extensive applications which benefit from highly recurrent and accurate 3D data. While we evaluated the spatial generalizability of Y-NET, we also acknowledge that Y-NET can be used for temporal generalizability and visually validate this, although the data for a proper evaluation are not currently available. While LiDAR data are sparse in time, we have shown that once a Y-NET model is trained, it can generate an accurate VSM given spatially disjoint data sources and identify mitigation and growth given different imagery despite new shadowing and environmental conditions.
While image-based methods such as photogrammetry allow for good 3D models to be generated more easily than LiDAR, they still require multiple photos with precise angular specifications, leading to higher collection costs when compared to Y-NET’s input data. Furthermore, in the event of a natural disaster, such as a wildland fire, photogrammetry would only be effective if the region were mapped in hindsight due to smoke occlusions at the time of a fire, wheres aerial imagery that Y-NET uses is constantly collected over large swaths of land. Furthermore, Y-NET allows for VSM recurrence at the timescale of multispectral imagery collection, allowing for spatially widespread 3D modeling of vegetation with only increasing recurrence as more satellites are deployed. Applications that would directly benefit from constantly updated VSMs include wildfire mitigation and modeling, biomass and ecosystem health, and forest economic yield forecasts. While providing higher frequency of measurements with lower cost than existing methods, we expect new previously unfathomable applications to emerge as well.
4.4. Input Feature Ablation Study
Figure 12 details the results of an input feature ablation study with Y-NET, where we trained and evaluated Y-NET while removing a single input feature each time. Doing so reveals the contribution of each input layer to Y-NET’s performance. We label each bar where “!Red” represents NAIP’s red band being removed from the input. All values recorded are the difference from the performance of Y-NET with all layers included, therefore downward facing bars mean the model yielded less error for that height range, suggesting those layers hinder performance when included and less important to the output.
We note the red imagery band and slope terrain layers are most important when determining the height of short vegetation under 1.8 m, though adversely affect Y-NET for tall vegetation. The NIR imagery band is most important for medium height vegetation, while the DEM, aspect, green, NIR, and NDVI are all significant for the tallest vegetation. With regards to overall statistical metrics not shown in
Figure 12, the NIR band causes the highest increase in mean error, RMSE, and also decreased
value the most. Parallel with
Figure 12, the red imagery and slope terrain layers have the greatest impact on the median error, since this value is consistently below 1.8 m.
While removing layers from Y-NET occasionally leads to better outcomes for specific height ranges, removing an input feature never increases Y-NET’s performance across all heights or evaluation metrics. Thus, we conclude that all input layers are beneficial for generating accurate VSMs, and the original Y-NET model with all layers is superior.
4.5. Future Work
While Y-NET is an important first step, we believe there is an abundance of future work with highly recurrent VSMs and in the intersection of artificial intelligence and remote sensing. Although Y-NET achieves good performance at high resolution, wildfire modeling software at high resolution has been lacking [
17], causing a spatial disconnect with actual fire fighting efforts. Perhaps existing fire modeling software such as Farsite [
26] or FireCast [
25] could be re-purposed to model fires at
m resolution and leverage Y-NET’s VSM for improved accuracy where recent LiDAR is not available.
In this paper, we have only discussed supervised machine learning, although we believe WUI fire modeling and mitigation strategies would also benefit from reinforcement learning given an up-to-date VSM from Y-NET. Reinforcement learning and remote sensing is an intersection that is almost entirely unexplored, and we envision an abundance of interesting applications emerging from the generalizability of Y-NET with reinforcement learning algorithms. We refer the reader to [
84] for more information on reinforcement learning. With data driven approaches emerging as the dominant force of research and development, we hope a proper evaluation of Y-NET’s temporal generalizability will be possible with future data to use as ground truth.
5. Conclusions
We propose Y-NET, a novel deep learning model to generate a high resolution VSM from readily available visual data and terrain data. The motivation behind our work stems from cost, complexity, and periodicity limitations of LiDAR for widespread remote sensing, and the need for current and future modeling software to transition towards higher resolution. Among many domains, wildland fire fighting and modeling is one that would directly benefit from up-to-date high resolution vegetation data, especially since Y-NET can generate a VSM given aerial imagery just before the time of a fire. We also expect the VSMs Y-NET generates to unlock a class of unprecedented applications which rely on up-to-date vegetation modeling.
Our method improves on past attempts to use deep learning for remote sensing by moving to m resolution and achieving a high value with low error from LiDAR. We empirically show that grouping similar input features into separate encoder branches and including skip connections in Y-NET drastically improves estimation performance. By visually evaluating Y-NET, we validate the models robustness given tiles with varying heights and densities of vegetation, ranging from single trees, to forests and grasslands, to the WUI. Furthermore, we identify instances of noise in existing LiDAR, deploy a DBSCAN clustering algorithm to mitigate that noise, and show that Y-NET can effectively model vegetation height and identify instances of vegetation growth and mitigation. Finally, we assert that Y-NET is a first step to using deep learning for VSMs. The abundance of available spatial data is attractive for data-driven techniques such as neural networks, and we argue the intersection between deep learning and GIS is currently underutilized.