Before pushing data through a pipeline in Azure Data Factory (ADF), connection information is required that enables ADF to access the source and destination data. Linked Services are objects that can do the following:
Represent the data store.
Linked Services supports a wide variety of data sources including on premise databases, cloud data stores and applications and various file systems as shown in the following graphic. Data sources are the initial data store that ADF will connect too to extract the source data. Data sinks are target data stores, that could either be a final destination of the data, or an intermediate data store that is used as a source for further downstream processing of data. Note that for on premise or Infrastructure as a Services (IaaS) Azure based virtual machines (VMs) you will need to configure the data gateway on these servers. We will cover that later.
Represent a compute resource.
Linked services are also used to represent a compute resource that are called to execute activities within ADF. These activities typically involve the transformation of data using a range of technologies including HDInsight, and a range of SQL Server technologies such as on premise or IaaS SQL Servers, or Azure SQL Data Warehouse using stored procedure objects to perform the transformations.
As well as transformations, you may also use a Linked Services to call a compute resource to perform advanced analytics including making calls to U-SQL Data Lake Analytics, performing analysis of data stored in a Data Lake. Alternatively, you may wish to create pipelines that use a published Azure Machine Learning web service for predictive analytics. As an example this can include using the Batch Execution Activity in an Azure Data Factory pipeline, you can invoke an Azure ML web service to make predictions on the data in batch.
A full list of activities that a Linked Service compute resource can utilize can be summarised in the following graphic:
Data Management Gateway
As outlined earlier in the article, using on premise or IaaS based VMs as a linked service does require a Data Management Gateway being installed on the server that ADF will access. This installs a client application on the server that enables the movement of data between on premise and cloud based locations.
Data Management Gateways ensures the efficient and secure movement of data between your premises and the cloud. The efficiency is achieved by ensuring data is transferred in parallel, resilient to network issues through auto retry logic.
The secure transfer is achieved by securing credentials with a certificate, and using the HTTP protocol to communicate with the cloud based services so that the corporate firewall does not need changing. As part of setting up the linked service, you should use the Setting Credentials application to specify authentication types and credentials. The Setting Credentials application dialog communicates with the data store to test connection and the gateway to save credentials. The gateway encrypts the credentials with the certificate associated with the gateway, before saving the credentials in the cloud. The gateway then manages the encryption and decryption of the credentials as data movement occurs.
In addition, the monitoring of the gateway can occur in the same place as monitoring other ADF objects within the Azure portal. This provides a single location in which the data movement processes can be monitored.
Finally, there are other considerations to consider with the Data Management Gateway including:
- A single gateway instance is tied to only one Azure data factory and cannot be shared with another data factory.
- A single instance of Data Management Gateway can be used for multiple on-premises data sources
- You can have only one instance of Data Management Gateway installed on a single machine.
- The gateway does not need to be on the same machine as the data source.
- You can have multiple gateways on different machines connecting to the same on-premises data source.
- If you already have a gateway installed on your computer serving a Power BI scenario, install a separate gateway for Azure Data Factory on another machine.
- Gateway must be used even when you use ExpressRoute.
Datasets
An activity takes zero or more datasets as inputs and one or more datasets as outputs. Datasets represent data structures within the data stores, which simply point or reference the data you want to use in your activities as inputs or outputs. For example, an Azure Blob dataset specifies the blob container and folder in the Azure Blob Storage from which the pipeline should read the data. Or, an Azure SQL Table dataset specifies the table to which the output data is written by the activity.