data science for
nighttime lights

part 1: data processing
ZIP File TAR Ball GitHub


Use R + Plotly to transform satellite imagery
into social insight



Credits
Technical Guidance

  • Chris Elvidge (NOAA)
Visualizations
  • Star Ying (Commerce Data Service)
  • Jeff Chen (Commerce Data Service)

Everyday, NOAA satellites collect terabytes of raw data.
Everytime a satellite passes over is an opportunity to improve our understanding of society in near real time. The various imagery captures different bandwidths of light, enabling a wide range of scientific and operational capabilities. In particular, nighttime satellite data can be transformed into approximations of economic activity as well as demographic patterns.

In collaboration with the National Oceanic and Atmospheric Administration (NOAA), the Commerce Data Service has conducted a number of research and development sprints to identify new age applications of satellite imagery to improve economic monitoring.

Among the many satellite missions is the Suomi National Polar-orbiting Partnership (Suomi NPP), which carries five state-of-the-art instruments that help understand the Earth in unprecedented detail. Among these instruments is the Visible Infrared Imaging Radiometer Suite (VIIRS) that is designed to provide moderate-resolution, radiometrically accurate images of the entire Earth twice daily. The VIIRS instrument collects a variety of data corresponding to different bandwidths of light. These in turn can be combined through well-tuned remote sensing algorithms to monitor and analyze a robust set of environmental conditions, including aerosols, clouds, land (fires, temperature), and ocean (ocean color and sea temperature). The immediate scientific contributions have been many and the applications in fields beyond earth science are largely untapped.

VIIRS

The VIIRS Day/Night Bands (DNB) collect levels of nighttime light across the globe at 750-meter resolution, which can be used to proxy popualation among other demographic and economic indicators. The data is available in a number of forms:

For ease of analysis, we’ll focus on the monthly composites. To illustrate the value, the 35 largest US cities by population are mapped and colored in order to reveal the intricate detail and gradations in each local urban environment. Within each city, the relative intensity of light is indicative of the differences in human activity: brighter yellow indicates relatively more population acitivity and darker blue indicates relatively less activity. The development patterns are distinct in each city, contoured to the waterways and roadways. The central business districts are appear in brilliant yellow clusters.



In the map above, light intensity is scaled relative to each respective city's light distribution. Placing all cities on the same light intensity scale allows for between city comparisions that reveal relative differences in human activity across the country. Whereas the yellow areas were relatively contained in the previous map, the new map shows yellow areas that are far more widespread in large cities like New York and Los Angeles, indicating that neighborhoods within those cities are among the brightest in the United States. Occassional specks of black indicate areas where the emitted light is "off the charts" and were not well sampled for the intervals calculation due to their rarity.


Seeing is just the beginning

Beyond visual examination, the radiance (light) distribution for a given geographic area provides clues into the size and density of local populations. To help illustrate this, the radiance distribution is plotted for five Metropolitan Statistical Areas (MSAs), a common geographic unit of analysis. Each of the follow MSAs were selected to represent different strata within the correponding population distribution:

In the histograms below, the x-axis represents the radiance level captured in the nighttime imagery, logarithmically scaled for each of interpretation where larger values indicate exponential increases in brightness. The y-axis represents the total number of pixels within the geographic footprint that correspond to each level of brightness. The Total Nighttime Light (TNL) values, which roughly corresponds to the summed product of the radiances multiplied by the pixel count, are provided in the label. It is clear that MSAs with greater population have larger TNL values as well as fuller, larger radiance distributions that span both low and high radiance values.



In direct comparison, TNL and population are positively correlated with a relatively high correlation coefficient of 0.78, showing promise as a proxy and can be tightened by incorporating the radiance distribution for approximating density.

Correlating with population is just the beginning. As the data is collected and disseminated daily in great spatial detail, there is an opportunity to improve via approximation the timeliness and spatial granularity of socioeconomic and demographic measures.


Getting Started

In this tutorial, we will illustrate how to manipulate raster data using the R statistical programming language. As a refresher, raster data is essentially a matrix of pixels organized into a grid (rows and columns) where each grid cell contains a value containing some information such as light, temperature, and concentration. Photographs and satellite imagery are considered to be raster data. By the end of this notebook, you will be able to process raster data into a form that can be joined with basic demographic data (population estimates).

The processing steps are as follows:

To get started quickly, the code for this tutorial can be found at the following Github repo.

Sec 1: Leading off

Start off by specifying the working directory as well as calling 7 libraries:

library(doParallel)
library(foreach)
library(raster)
library(sp)
library(rgdal)
library(ggmap)
library(plotly)

Sec 2: Loading Data

VIIRS Day/Night Band: Satellite Data

Satellite data is complex and processed in a number of ways to correct for a manifold of environmental conditions like stray light. In the case of analyzing population demographics, we use VIIRS DNB monthly composites that omit records that have been effected by stray light. To obtain the data:

  • Raster files can be downloaded from http://ngdc.noaa.gov/eog/viirs/download_monthly.html;
  • Under each month’s folder, select Tile1_75N180W, which contains data for North America;
  • Within this folder, download the file labeled VCMCFG containing data that excludes any stray light.
  • There are two files. The file ending in avg_rade9 contains average radiances; this should be set aside for use. The other file ending in “” contains the number of cloud free pixels included in each pixel’s calculation.

Create a folder labeled “imagery” and save the avg_rade9 .tif file into that directory. Below, we then specify the path for future use, open the first and only raster in the list of .tif files, then assign WGS84 as the spatial projection.

##Set directory path
  imagery = "[path]/imagery"

##Obtain a list of TIF files, load in the first file in list
  tifs = list.files(imagery,pattern = "\\.tif")
  rast <- raster(paste(imagery,"/",tifs[1],sep=""))

##Specify WGS84 as the projection of the raster file
  wgs84 <- "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"
  projection(rast) <- CRS(wgs84)


Metropolitan Statistical Areas (MSAs)

In addition, we use MSA shapefiles and population data from the US Census Bureau. To draw down the shapefile, we write a function to download the shapefiles from the US Census Bureau website, then assign the same WGS84 projection to the shapefile. By assigning the same spatial projection, we can work across shapefiles and raster files.

   ##Draw down MSA Shapefile
  shape_direct <- function(url, shp) {
    library(rgdal)
    temp = tempfile()
    download.file(url, temp) ##download the URL taret to the temp file
    unzip(temp,exdir=getwd()) ##unzip that file
    return(readOGR(paste(shp,".shp",sep=""),shp))
  }

  msa <- shape_direct(url="http://www2.census.gov/geo/tiger/GENZ2014/shp/cb_2014_us_cbsa_20m.zip", 
  shp= "cb_2014_us_cbsa_20m")
  projection(msa) <- CRS(wgs84)

We load in a CSV containing population data by MSA as found on the Census Bureau Population Estimates section. As we only need MSAs, we’ll subset the data to MSAs only using the LSAD field, re-sort the data frame by 2014 population estimates, and convert the MSA names as characters/string as opposed to factors. This will save some trouble down the line.

  msa_pop <- read.csv("http://www.census.gov/popest/data/metro/totals/2014/files/CBSA-EST2014-alldata.csv")
  msa_pop <- msa_pop[msa_pop$LSAD=="Metropolitan Statistical Area",]
  msa_pop <- msa_pop[order(msa_pop$POPESTIMATE2014),]
  msa_pop$NAME <- as.character(msa_pop$NAME) 

Sec 3: Mapping

At this point, we’ll begin to map the top 35 most populous cities, scaling the color palette to reveal local detail within each city. The top 35 cities are listed below:

  cities <- c("New York, NY", "Los Angeles, CA","Chicago, IL", "Houston, TX",
              "Philadelphia, PA", "Phoenix, AZ", "San Antonio, TX", "San Diego, CA",     
              "Dallas, TX", "San Jose, CA", "Austin, TX", "Jacksonville, FL",
              "San Francisco, CA", "Indianapolis, IN", "Columbus, OH", "Fort Worth, TX",
              "Charlotte, NC", "Detroit, MI", "El Paso, TX", "Seattle, WA",
              "Denver, CO","Washington, DC", "Memphis, TN", "Boston, MA",
              "Nashville, TN", "Baltimore, MD", "Oklahoma City, OK", "Portland, OR",
              "Las Vegas, NV", "Louisville, KY","Milwaukee, WI","Albuquerque, NM",
              "Tucson, AZ","Fresno, CA","Sacramento, CA")

Next, we’ll set up the graph layout with no margins (mai), with 7 rows and 5 columns (mfrow), with a navy blue background (bg).

##Set graph layout
  par(mai=c(0,0,0,0),mfrow = c(7,5),bg='#001a4d', bty='n')

Now let’s get looping to create a montage of city maps. The process is as follows:

##Loop through data
  coords <- data.frame() ##place holder
  
  for(i in 1:length(cities)){
    
    ##Coords
    temp_coord <- geocode(cities[i], source = "google")
    coords <- rbind(coords,temp_coord)
    
      e <- extent(temp_coord$lon - 1, temp_coord$lon + 1,
                  temp_coord$lat - 0.25, temp_coord$lat + 0.25)
      rc <- crop(rast, e)    
      
    ##Rescale brackets
      sampled <- as.vector(rc)
      clusters <- 15
      clust <- kmeans(sampled,clusters)$cluster
      combined <- as.data.frame(cbind(sampled,clust))
      brk <- sort(aggregate(combined[,1], list(combined[,2]), max)[,2])
      
    #Plots
      plot(rc, breaks=brk, col=colorRampPalette(c("#001a4d","#0066FF", "yellow"))(clusters), 
           legend=F,yaxt='n',xaxt='n',frame = F, asp=1.5)
      text(temp_coord$lon ,temp_coord$lat + 0.15,
           substr(cities[i],1,regexpr(",",cities[i])-1), 
           col="white", cex=1.25)
       
    rm(combined)
  }

Whereas the map above shows city-specific patterns, the second map focuses on comparing patterns across cities using the same color coding. Rather than running a k-means algorithm for each city in order to generate the radiance intervals, one interval is generated based on a random sample of pixels from across the country.

#Set layout
  par(mai=c(0,0,0,0),mfrow = c(7,5),bg='#001a4d', bty='n')

#Run clustering
  set.seed(123) #set seed for reproducibility
  sampled <- sample(rast, 20000) #sample 20,000 pixels
  clusters <- 15 ##15 clusters
  clust <- kmeans(sampled,clusters)$cluster
  combined <- as.data.frame(cbind(sampled,clust))
  brk <- sort(aggregate(combined[,1], list(combined[,2]), max)[,2])

##Loop through each city
  for(i in 1:length(cities)){
    
    temp_coord <- coords[i,] ##re-use the coordinates 
      e <- extent(temp_coord$lon - 1, temp_coord$lon + 1,
                  temp_coord$lat - 0.25, temp_coord$lat + 0.25)
      rc <- crop(rast, e)    
    
    #Plots
      plot(rc, breaks=brk, col=colorRampPalette(c("#001a4d","#0066FF", "yellow"))(clusters), 
           legend=F,yaxt='n',xaxt='n',frame = F, asp=1.5)
      text(temp_coord$lon ,temp_coord$lat + 0.15,
           substr(cities[i],1,regexpr(",",cities[i])-1), 
           col="white", cex=1.25)
       
    rm(combined)
  }


Processing raster data into matrices and vectors

Moving onto more advanced functionality, we’ll now want to go beyond just the visuals and analyze the underlying radiance distribution. To do so, we first need to write a function that’ll allow parallel processing. By writing the code in a manner that isolates processing to each polygon, we can run multiple polygon’s processing simultaneously if need be.

The function accepts three parameters:

  masq <- function(shp,rast,i){
    
    #Extract one polygon based on index value i
      polygon <- shp[i,] #extract one polygon
      extent <- extent(polygon) #extract the polygon extent 
    
    #Raster extract
      outer <- crop(rast, extent) #extract raster by polygon extent
      inner <- mask(outer,polygon) #keeps values from raster extract that are within polygon
      
    #Convert cropped raster into a vector
      #Specify coordinates
        coords <- expand.grid(seq(extent@xmin,extent@xmax,(extent@xmax-extent@xmin)/(ncol(inner)-1)),
                              seq(extent@ymin,extent@ymax,(extent@ymax-extent@ymin)/(nrow(inner)-1)))
      #Convert raster into vector
        data <- as.vector(inner)
      
      #package data in neat dataframe
        data <- cbind(as.character(shp@data$CBSAFP[i]),coords, data) 
        colnames(data)<-c("GEOID","lon","lat","avg_rad") #note that 
        data <- data[!is.na(data$avg_rad),] #keep non-NA values only
    
    return(data)
  }

Sec 4: Comparing measures through visualizations

Histogram by select MSA

For a quick example, we’ve selected five MSAs that are serially processed into histograms of logarithmically transformed radiance. The data is processed first for each of the five MSAs, then turned into a series of histograms using a combination of ggplot and plotly.

##MSAs by GEOID
  msa_list <- c(16180,19140,45820,42540,35620)

##Placeholder
  radiances <- data.frame() 
  
##Loop MSA file
  for(i in msa_list){
  
    print(i)
    
    #Extract MSA i polygon
      shp_temp <- msa[msa@data$GEOID==i,]
      
    #Extract MSA abbreviated name
      if(regexpr("-",as.character(shp_temp@data$NAME)[1])[1]==-1){
        loc = as.character(substr(as.character(shp_temp@data$NAME)[1],1,regexpr(",",as.character(shp_temp@data$NAME)[1])-1))
      } else{
        loc = as.character(substr(as.character(shp_temp@data$NAME)[1],1,regexpr("-",as.character(shp_temp@data$NAME)[1])-1))
      }
    
    #Extract the radiances, append to radiances placeholder
      rad <- masq(shp_temp,rast,1)$avg_rad 
      temp <- data.frame(loc = as.character(paste(loc,"(TNL = ",round(sum(rad),0),")",sep="")), avg_rad = rad) 
      radiances <- rbind(radiances,temp)
  }

#Use ggplot to create histograms by MSA group. Preload.
  ggplot(radiances, aes(x=log(avg_rad))) +
    geom_histogram(position="identity", alpha=0.4) +
    facet_grid(. ~ loc)

#Remove all axes labels for style
    x <- list(
        zeroline = FALSE,
        showline = FALSE,
        showticklabels = FALSE,
        showgrid = FALSE
    )
    y <- list(
        zeroline = FALSE,
        showline = FALSE,
        showticklabels = FALSE,
        showgrid = FALSE
    ) 
    
#Initiate a plotly graph without axes
  ggplotly()  %>% layout(xaxis=x, yaxis=y)

Scatterplot of TNL vs. Population

To extract radiance data from five MSAs take a few seconds, but a few hundred MSAs takes a while longer. We rely on the doParallel and foreach libraries to parallelize data extraction. Below, two processor cores are used to extract the TNL for each MSA and record into a consolidated data frame of MSAs and TNL. This data frame is then joined with the MSA population files to allow direct comparison between population and TNL. The ultimate interactive graph illustrates a clear positive correlation between the two quantities.

##Set up comparisons
    registerDoParallel(cores=2)
    extract <- foreach(i=1:nrow(msa@data), .combine=rbind) %dopar% {
        data <- masq(msa,rast,i)
        data.frame(GEOID = data$GEOID[1],sum = sum(data$avg_rad))
    }
   extract$GEOID <- as.numeric(as.character(extract$GEOID))
    
  ##Join in data
    joined<-merge(extract, msa_pop[,c("CBSA","NAME","POPESTIMATE2014")],by.x="GEOID",by.y="CBSA")
   
    colnames(joined) <- c("GEOID","TNL","MSA","Population")
    
    plot_ly(joined, 
            x = log(TNL), 
            y = log(Population), 
            text = paste("MSA: ", MSA),
            mode = "markers", 
            color = TNL,colors="PuOr")  %>% 
            layout(title="Total Nighttime Light vs. Population", showlegend = F)

Basic joining is just the beginning. In an upcoming tutorial, we will demonstrate how to mine county-level data and identify which can be accurately predicted using VIIRS data.