Azure Data Lake Storage Gen2 Introduction - Azure Storage (2024)

  • Article

Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built on Azure Blob Storage.

Data Lake Storage Gen2 converges the capabilities of Azure Data Lake Storage Gen1 with Azure Blob Storage. For example, Data Lake Storage Gen2 provides file system semantics, file-level security, and scale. Because these capabilities are built on Blob storage, you also get low-cost, tiered storage, with high availability/disaster recovery capabilities.

Data Lake Storage Gen2 makes Azure Storage the foundation for building enterprise data lakes on Azure. Designed from the start to service multiple petabytes of information while sustaining hundreds of gigabits of throughput, Data Lake Storage Gen2 allows you to easily manage massive amounts of data.

What is a Data Lake?

A data lake is a single, centralized repository where you can store all your data, both structured and unstructured. A data lake enables your organization to quickly and more easily store, access, and analyze a wide variety of data in a single location. With a data lake, you don't need to conform your data to fit an existing structure. Instead, you can store your data in its raw or native format, usually as files or as binary large objects (blobs).

Azure Data Lake Storage is a cloud-based, enterprise data lake solution. It's engineered to store massive amounts of data in any format, and to facilitate big data analytical workloads. You use it to capture data of any type and ingestion speed in a single location for easy access and analysis using various frameworks.

Data Lake Storage Gen2

Azure Data Lake Storage Gen2 refers to the current implementation of Azure's Data Lake Storage solution. The previous implementation, Azure Data Lake Storage Gen1 will be retired on February 29, 2024.

Unlike Data Lake Storage Gen1, Data Lake Storage Gen2 isn't a dedicated service or account type. Instead, it's implemented as a set of capabilities that you use with the Blob Storage service of your Azure Storage account. You can unlock these capabilities by enabling the hierarchical namespace setting.

Data Lake Storage Gen2 includes the following capabilities.

✓ Hadoop-compatible access

✓ Hierarchical directory structure

✓ Optimized cost and performance

✓ Finer grain security model

✓ Massive scalability

Hadoop-compatible access

Azure Data Lake Storage Gen2 is primarily designed to work with Hadoop and all frameworks that use the Apache Hadoop Distributed File System (HDFS) as their data access layer. Hadoop distributions include the Azure Blob File System (ABFS) driver, which enables many applications and frameworks to access Azure Blob Storage data directly. The ABFS driver is optimized specifically for big data analytics. The corresponding REST APIs are surfaced through the endpoint dfs.core.windows.net.

Data analysis frameworks that use HDFS as their data access layer can directly access Azure Data Lake Storage Gen2 data through ABFS. The Apache Spark analytics engine and the Presto SQL query engine are examples of such frameworks.

For more information about supported services and platforms, see Azure services that support Azure Data Lake Storage Gen2 and Open source platforms that support Azure Data Lake Storage Gen2.

Hierarchical directory structure

The hierarchical namespace is a key feature that enables Azure Data Lake Storage Gen2 to provide high-performance data access at object storage scale and price. You can use this feature to organize all the objects and files within your storage account into a hierarchy of directories and nested subdirectories. In other words, your Azure Data Lake Storage Gen2 data is organized in much the same way that files are organized on your computer.

Operations such as renaming or deleting a directory, become single atomic metadata operations on the directory. There's no need to enumerate and process all objects that share the name prefix of the directory.

Optimized cost and performance

Azure Data Lake Storage Gen2 is priced at Azure Blob Storage levels. It builds on Azure Blob Storage capabilities such as automated lifecycle policy management and object level tiering to manage big data storage costs.

Performance is optimized because you don't need to copy or transform data as a prerequisite for analysis. The hierarchical namespace capability of Azure Data Lake Storage allows for efficient access and navigation. This architecture means that data processing requires fewer computational resources, reducing both the speed and cost of accessing data.

Finer grain security model

The Azure Data Lake Storage Gen2 access control model supports both Azure role-based access control (Azure RBAC) and Portable Operating System Interface for UNIX (POSIX) access control lists (ACLs). There are also a few extra security settings that are specific to Azure Data Lake Storage Gen2. You can set permissions either at the directory level or at the file level. All stored data is encrypted at rest by using either Microsoft-managed or customer-managed encryption keys.

Massive scalability

Azure Data Lake Storage Gen2 offers massive storage and accepts numerous data types for analytics. It doesn't impose any limits on account sizes, file sizes, or the amount of data that can be stored in the data lake. Individual files can have sizes that range from a few kilobytes (KBs) to a few petabytes (PBs). Processing is executed at near-constant per-request latencies that are measured at the service, account, and file levels.

This design means that Azure Data Lake Storage Gen2 can easily and quickly scale up to meet the most demanding workloads. It can also just as easily scale back down when demand drops.

Built on Azure Blob Storage

The data that you ingest persist as blobs in the storage account. The service that manages blobs is the Azure Blob Storage service. Data Lake Storage Gen2 describes the capabilities or "enhancements" to this service that caters to the demands of big data analytic workloads.

Because these capabilities are built on Blob Storage, features such as diagnostic logging, access tiers, and lifecycle management policies are available to your account. Most Blob Storage features are fully supported, but some features might be supported only at the preview level and there are a handful of them that aren't yet supported. For a complete list of support statements, see Blob Storage feature support in Azure Storage accounts. The status of each listed feature will change over time as support continues to expand.

Documentation and terminology

The Azure Blob Storage table of contents features two sections of content. The Data Lake Storage Gen2 section of content provides best practices and guidance for using Data Lake Storage Gen2 capabilities. The Blob Storage section of content provides guidance for account features not specific to Data Lake Storage Gen2.

As you move between sections, you might notice some slight terminology differences. For example, content featured in the Blob Storage documentation, will use the term blob instead of file. Technically, the files that you ingest to your storage account become blobs in your account. Therefore, the term is correct. However, the term blob can cause confusion if you're used to the term file. You'll also see the term container used to refer to a file system. Consider these terms as synonymous.

See also

  • Introduction to Azure Data Lake Storage Gen2 (Training module)
  • Best practices for using Azure Data Lake Storage Gen2
  • Known issues with Azure Data Lake Storage Gen2
  • Multi-protocol access on Azure Data Lake Storage
Azure Data Lake Storage Gen2 Introduction - Azure Storage (2024)

FAQs

What is the difference between Azure Blob Storage and Azure Data Lake storage Gen 2? ›

In short, Azure Blob Storage is a simple and cost-effective storage solution for unstructured data. At the same time, ADLS Gen2 is a more advanced storage solution for big data analytics workloads. The choice between the two depends on your specific requirements and use case.

What is Azure Data Lake storage Gen2 used for? ›

Azure Data Lake Storage Gen2 (ADLS) is a cloud-based repository for both structured and unstructured data. For example, you could use it to store everything from documents to images to social media streams. Data Lake Storage Gen2 is built on top of Blob Storage.

What are the key capabilities of data lake store Gen 2 include? ›

For example, Data Lake Storage Gen2 provides file system semantics, file-level security, and scale. Because these capabilities are built on Blob storage, you also get low-cost, tiered storage, with high availability/disaster recovery capabilities.

Which of the following storage services is Azure Data Lake storage Gen2 built on? ›

Azure Data Lake Storage Gen2 (ADLS Gen2) is a set of capabilities dedicated to big data analytics built into Azure Blob storage. You can use it to interface with your data by using both file system and object storage paradigms.

What is the difference between data lake store and Blob Storage? ›

Azure Blob Storage is one of the most common Azure storage types. It's an object storage service for workloads that need high-capacity storage. Azure Data Lake is a storage service intended primarily for big data analytics workloads.

What is the difference between ADLS Gen1 and Gen2 and Blob Storage? ›

Difference Between ADLS Gen1 and Gen2

Hadoop compatible: ADLS Gen1 is based on Hadoop Distributed File System (HDFS), while ADLS Gen2 is built on top of Azure Blob Storage. This makes ADLS Gen2 more scalable and cost-effective, as it can leverage the capabilities of Blob Storage.

What are the key components of Azure Data Lake? ›

Azure Data Lake consists of three main components that provide storage, analytics service, and cluster capabilities.

What is the advantage of Azure Data Lake? ›

Data lakes provide organizations with a single repository for all their data, both structured and unstructured. Organizations can replicate, move, and store their data from multiple sources in a data lake, data warehouse, or database using data integration.

What is benefit of Azure Data Lake? ›

Scalable storage tools like Azure Data Lake Storage can hold and protect data in one central place, eliminating silos at an optimal cost. This lays the foundation for users to perform a wide variety of workload categories, such as big data processing, SQL queries, text mining, streaming analytics, and machine learning.

What is the difference between data store and data lake? ›

While data warehouses store structured data, a lake is a centralized repository that allows you to store any data at any scale. A data lake offers more storage options, has more complexity, and has different use cases compared to a data warehouse.

Do you need to create a data lake Gen2 storage account before creating an Azure Synapse analytics workspace? ›

Azure Data Lake Storage Gen2: You must have an Azure Data Lake Storage Gen2 account and Owner and Storage Blob Data Contributor role access. Your storage account must enable Hierarchical namespace for both initial setup and delta sync. Allow storage account key access is required only for the initial setup.

What is the recommended file size for data lake storage Gen 2? ›

In general, organize your data into larger sized files for better performance (256 MB to 100 GB in size). Some engines and applications might have trouble efficiently processing files that are greater than 100 GB in size. Increasing file size can also reduce transaction costs.

How do I create a container in Azure Data Lake storage Gen2? ›

Create a container

A container holds directories and files. To create one, expand the storage account you created in the proceeding step. Select Blob Containers, right-click, and select Create Blob Container. Alternatively, you can select Blob Containers, then select Create Blob Container in the Actions pane.

How do I create Gen2 storage in Azure? ›

Creating a Storage Account to use with Microsoft Azure Data Lake Storage Gen2
  1. Under Azure Services, click. Storage accounts. .
  2. On the. Storage accounts. page, click. ...
  3. On the. Basics. ...
  4. On the. Advanced. ...
  5. Click. Review + Create. ...
  6. Click on the newly created storage account name.
  7. Click. Access control (IAM) ...
  8. On the. Add role assignment.

How do I create a folder in Azure Data Lake storage Gen2? ›

Once our ADLS Gen2 storage account is created, go to the ADLS Gen2 storage, click on containers/folders under the data storage tab, then click on + Create to create a container/folder.

What is Blob Storage and data lake? ›

Blob Storage is accessible through HTTP or HTTPS. Data Lake can be accessed through various big data processing tools and technologies. Usecase. Blob Storage is used for storing and retrieving large files, such as images, videos, and backups. Data Lake is used for IoT, big data analytics, and machine learning purposes.

What is the difference between Azure Blob Storage and Azure data Factory? ›

Microsoft's Blob Storage system on Azure is designed to make unstructured data available to customers anywhere through REST-based object storage. Microsoft's Azure Data Factory is a service built for all data integration needs and skill levels.

What is the difference between Azure Data Lake and Azure data warehouse? ›

What's the difference between a data lake and a data warehouse? Data lakes store all types of raw data, which data scientists may then use for a variety of projects. Data warehouses store cleaned and processed data, which can then be used to source analytic or operational reporting, as well as specific BI use cases.

What is the difference between Azure Data Hub and data lake? ›

In summary, a data hub is about sharing and exchanging curated and managed data between systems, services, or parties. A data lake is about creating a vast pool of data in many different formats which can feed analytics, AI or data science services to create value.

References

Top Articles
Latest Posts
Article information

Author: Carmelo Roob

Last Updated:

Views: 6323

Rating: 4.4 / 5 (45 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Carmelo Roob

Birthday: 1995-01-09

Address: Apt. 915 481 Sipes Cliff, New Gonzalobury, CO 80176

Phone: +6773780339780

Job: Sales Executive

Hobby: Gaming, Jogging, Rugby, Video gaming, Handball, Ice skating, Web surfing

Introduction: My name is Carmelo Roob, I am a modern, handsome, delightful, comfortable, attractive, vast, good person who loves writing and wants to share my knowledge and understanding with you.