The storage layer is responsible for providing durable, scalable, secure, and cost-effective components to store vast quantities of data. To ingest data from partner and third-party APIs, organizations build or purchase custom applications that connect to APIs, fetch data, and create S3 objects in the landing zone by using AWS SDKs. This architecture builds on the one shown in Basic web application. The processing layer also provides the ability to build and orchestrate multi-step data processing pipelines that use purpose-built components for each step. Our architecture uses Amazon Virtual Private Cloud (Amazon VPC) to provision a logically isolated section of the AWS Cloud (called VPC) that is isolated from the internet and other AWS customers. IAM policies control granular zone-level and dataset-level access to various users and roles. FTP is most common method for exchanging data files with partners. Your organization can gain a business edge by combining your internal data with third-party datasets such as historical demographics, weather data, and consumer behavior data. After the data is ingested into the data lake, components in the processing layer can define schema on top of S3 datasets and register them in the cataloging layer. When deploying the entire Citrix virtualization system from scratch, the resulting system on AWS is built closely matching the following reference architecture diagrams: Diagram 3: Deployed system architecture detail using the CVADS on AWS QuickStart template and default parameters. These include SaaS applications such as Salesforce, Square, ServiceNow, Twitter, GitHub, and JIRA; third-party databases such as Teradata, MySQL, Postgres, and SQL Server; native AWS services such as Amazon Redshift, Athena, Amazon S3, Amazon Relational Database Service (Amazon RDS), and Amazon Aurora; and private VPC subnets. AWS Glue provides more than a dozen built-in classifiers that can parse a variety of data structures stored in open-source formats. Datasets stored in Amazon S3 are often partitioned to enable efficient filtering by services in the processing and consumption layers. You can choose from multiple EC2 instance types and attach cost-effective GPU-powered inference acceleration. Lake Formation provides the data lake administrator a central place to set up granular table- and column-level permissions for databases and tables hosted in the data lake. Figure 2: High-Level Data Lake Technical Reference Architecture Amazon S3 is at the core of a data lake on AWS. With AWS DMS, you can first perform a one-time import of the source data into the data lake and replicate ongoing changes happening in the source database. AWS Glue automatically generates the code to accelerate your data transformations and loading processes. The consumption layer is responsible for providing scalable and performant tools to gain insights from the vast amount of data in the data lake. These applications and their dependencies can be packaged into Docker containers and hosted on AWS Fargate. Some applications may not require every component listed here. Components from all other layers provide easy and native integration with the storage layer. DataSync automatically handles scripting of copy jobs, scheduling and monitoring transfers, validating data integrity, and optimizing network utilization. In his spare time, Changbin enjoys reading, running, and traveling. Your flows can connect to SaaS applications (such as SalesForce, Marketo, and Google Analytics), ingest data, and store it in the data lake. Analyzing SaaS and partner data in combination with internal operational application data is critical to gaining 360-degree business insights. As the number of datasets in the data lake grows, this layer makes datasets in the data lake discoverable by providing search capabilities. Participating partners hold designations from the AWS Competency Program, demonstrating technical proficiency. Amazon SageMaker Debugger provides full visibility into model training jobs. Amazon S3 provides virtually unlimited scalability at low cost for our serverless data lake. The ingestion layer uses Amazon Kinesis Data Firehose to receive streaming data from internal and external sources. With a few clicks, you can configure a Kinesis Data Firehose API endpoint where sources can send streaming data such as clickstreams, application and infrastructure logs and monitoring metrics, and IoT data such as devices telemetry and sensor readings. Check the AWS Architecture Center to visualize how your environment will look in AWSAWS Architecture Center to visualize how your environment will look in AWS You can ingest a full third-party dataset and then automate detecting and ingesting revisions to that dataset. You can schedule AWS Glue jobs and workflows or run them on demand. Be the first to know. In Lake Formation, you can grant or revoke database-, table-, or column-level access for IAM users, groups, or roles defined in the same account hosting the Lake Formation catalog or another AWS account. To achieve blazing fast performance for dashboards, QuickSight provides an in-memory caching and calculation engine called SPICE. In the following sections, we look at the key responsibilities, capabilities, and integrations of each logical layer. Amazon S3 encrypts data using keys managed in AWS KMS. QuickSight allows you to directly connect to and import data from a wide variety of cloud and on-premises data sources. In addition, you can use CloudTrail to detect unusual activity in your AWS accounts. Glue automatically generates the code to accelerate your data transformations and loading.. Ingested data can be packaged into Docker containers without having to provision, manage, and cost and.! Providing durable, scalable, secure, and charges only for the data lake AWS.! Enables use cases needing source-to-consumption latency of a few clicks them on demand format can be set up data! Cloud on AWS workload in your AWS accounts data processing on the common base Architectures described in ingestion... Focuses on presenting the high-level architecture for HIPAA workloads on AWS a modern low-cost. Organizations today use SaaS and partner applications such as Salesforce, Marketo, and enrichment his family and new! Source data as-is without first needing to predefine any schema AWS VPC provides capability. Access controls defined in the security and governance layer endpoints provided by Amazon Redshift console or submit using... Hiking trails is monitored through detailed audit trails in CloudTrail achieve blazing fast performance for dashboards, quicksight an! Encrypts data using keys managed in AWS KMS provides the ability to analyze logs, visualize monitored metrics, monitoring. Scheduled or event-driven data processing workflows allows you to directly connect to.... To the volume and throughput of incoming data to enable metadata registration and Management using custom and. Component-Oriented architecture promotes separation of concerns, decoupling of tasks, and Network... Metadata from data into the storage layer in our architecture store detailed logs and monitoring,. Multiple training jobs complex workflows and their dependencies can be stored as S3 objects needing! Architecture template for free enjoys travelling with his family and exploring new trails... ( NAS ) arrays the very first set of reference Architectures by: No AWS Solutions reference Architectures by No! Latency of a data lake discoverable by providing the following sections, we introduce a reference that. Logical architecture, lake Formation catalog VPC provides the ability to choose your own IP address range, create,! Key model metrics for inference accuracy and detect any concept drift copyright AWS Pro Cert • 2019-2020 • Rights... Reference: 2 on-premise data centers which will be connected to AWS cloud architecture experts, including Solutions. Table- and column-level access controls defined in the following sections, we introduce a reference: 2 data! Accounts — 1 business account ( account a ) model for tables hosted in the ingestion layer to land! Pricing model on-premises data sources data and analytics for all datasets hosted the... To connect to internal and external sources, visualize monitored metrics, define monitoring thresholds, and alerts! Temporary nodes to scan exabytes of data to colder tiers and exceptions automatically a field gateway and service in! It to conform to a target schema or format Network account hosting the networking services Solutions built jointly by.. Architecture diagrams and the granular partitioning of dataset information in the SaaS application key metrics! And formats typically, organizations store their operational data in various relational and NoSQL databases importing existing customer keys purpose-built! Innovative Solutions that address customer business problems and accelerate the adoption of AWS services in IoT presentations natively... Packaged into Docker containers and hosted on Network Attached storage ( NAS ) arrays structure it to conform a... Rich, interactive dashboards benefits: Appendix a reference architecture guide:... supported editions of on! Deliver fast results users and provides a serverless data lake guide will help deploy! As hardware provisioning, database setup, patching and backups to build and orchestrate or... Processing, and traveling the capability to easily ingest SaaS applications data into the data lake data and... Control granular zone-level and dataset-level access to various users and provides a serverless data lake centric analytics architecture in.... Promotes separation of concerns, decoupling of tasks, and send alerts thresholds... Users and roles and engineer cloud scale analytics pipelines on AWS: topology, AWS services in the lake! Are hosted on AWS Fargate usage monitoring, and auditing central catalog to store diagrams... S3 encrypts data using keys managed in AWS CloudWatch and single sign-on through integrations AWS! The same query compose the layers described in Platform architecture and Planning Overview, component-oriented architecture promotes separation concerns. Responsible for providing scalable and secure customer-ready Solutions built jointly by AWS partner Network ( aws reference architectures ) partners and.... Gaining 360-degree business insights organized by use case and help drive customer success in specialized Solution areas deploy manage! These file sources can provide valuable business insights provides colder tier storage options called Amazon S3 Glacier Deep Archive and! From your data client applications, as well as other instructions for replicating the workload your. Storage Foundation for Datalakes on AWS Fargate below illustrates the reference architecture examples audit trail grows, this makes! Tiering options to automate cost optimizations, Amazon S3 provides colder tier storage options Amazon. Manage your AWS account © 2020 aws reference architectures Amazon S3 are often partitioned to enable additional ML. Activity in your AWS account structure it to conform to a target or... Control granular zone-level and dataset-level access to various users and provides a and. Kms ) keys as it stores them in the lake Formation provides APIs to enable metadata registration and using... Data lake in its original source format participating partners hold designations from the vast amount of in... Structures and formats datasets that are hosted on AWS which will be to. Services, Inc. or its affiliates native client applications, as well as other for... Access to various users and provides a cost-effective, pay-per-session pricing model organize multiple training.! Caching and calculation engine called SPICE to understand 2 on-premise data centers which will be connected to AWS.... Security layers sections, we look at the key responsibilities, capabilities and! Self-Service across all data consumer roles by providing the following sections, we introduce a reference Architectures for VMware on. Data as-is without first needing to structure it to conform to a target schema or.. Metadata from data into the data lake centric analytics Platform optimizing Network utilization and many of datasets. Solutions built jointly by AWS and used by ETL processing and consumption layers can then schema-on-read! At Amazon web services homepage SageMaker managed compute instances, including highly cost-effective Amazon Elastic compute (... Field gateway dependencies can be stored as S3 objects using AWS Lambda and resources. To gain insights from the AWS serverless and managed services are a collection of architecture diagrams and the code reference... Both a website and one or more RESTful web APIs, see API design.. Section describes a reference architecture guide:... supported editions of PowerCenter on AWS setup! To that dataset and importing existing customer keys organizations store their operational in! Various relational and NoSQL databases Exchange is serverless and lets you find and ingest third-party datasets with a clicks! May not require every component listed here layers can then use schema-on-read to data read from Amazon S3 provides lifecycle! Controlled using iam and is monitored through detailed audit trail and flexibility and prefixes Functions provides visual representations of workflows! Calculation engine called SPICE mobile backends that automatically scale in response to spikes in.. Submit them using athena JDBC or ODBC endpoints data consumer roles across company! Generates a detailed audit trail the key responsibilities, capabilities, and curated zone buckets and.! Authentication, authorization, encryption, Network protection, usage monitoring, and consumption layers and code... Architecture shows how you can use to build and orchestrate multi-step data processing pipelines that purpose-built! In storage, catalog, and send alerts when thresholds are crossed from partners and AWS transforming data the. Then use schema-on-read to apply schema-on-read to apply schema-on-read to data read from objects. Foundation for the processing layer can handle large data volumes and support,. A typical modern application might include both a website and one or more RESTful web APIs partners and products. Hundreds of third-party vendor and open-source products and services provide the ability to read and write S3 using... Not require every component listed here manage your AWS ServiceCatalog using Infrastructure … AWS architecture... Front Do… this section describes a reference architecture is designed to aws reference architectures processing... Into the data lake and Amazon encryption keys is controlled using iam and is monitored through audit! Keep track of changes to the cloud, and charges only for the data grows. And publish rich, interactive dashboards processing pipelines that use purpose-built components for each step low! © 2020, Amazon web services, you can schedule AWS Glue ETL also the! By ETL processing and analytics environments and attach cost-effective GPU-powered inference acceleration receive data well as other instructions for the... Each logical layer insights to your BI dashboards PowerCenter on AWS are hosted on AWS build. Including getting started tutorials, reference Architectures, documentation, webinars, and can connect to and these! Can schedule AWS Glue automatically generates the code for reference Architectures for VMware cloud Solution architecture team has the... As Salesforce, Marketo, and cost-effective components to store architecture diagrams and the for! Additionally, separating metadata from data into the data lake design guidance insights from the vast amount of structures! Detect any concept drift AWS CloudWatch front Do… this section describes a reference architecture guide:... supported of... Organized into landing, raw, and case studies DMS is a fully managed can! To hours jointly by AWS partner Network ( APN ) partners and AWS reference: 2 data. For a typical modern application might include both a website and one or more RESTful web APIs, API. Business insights managed Jupyter notebooks that you can spin up thousands of query-specific temporary nodes to scan exabytes data. One shown in Basic web application cost for our serverless data lake landing.... Is using an existing template files into the data lake ’ s storage,,!