Glue란 무엇인가?
▶ Fully Managed ETL Service
▶ Consists of a Central Metadata Repository - Glue Data Catalog
▶ A spark ETL Engine
▶ Flexible Scheduler
Why Glue?
AWS Glue offers a fully managed serverless ETL Tool. This removes the overhead(추가적인 비용, 시간, 자원), and barriers to entry, when there is a requirement for a ETL service in AWS.
Set Up
▶ S3 bucket
원하는 버킷을 생성하고, 연습할 csv 파일을 경로에 업로드한다.
▶ IAM role
역할 생성 - Glue - Admin(full access)로 일단 생성
AWS Glue Data Catalog
Persistent Metadata Store
▶ it is a managed service that lets you store, annotate, and share metadata which can be used to query and transform data
▶ One AWS Glue Data Catalog per AWS region
▶ Identity and Access Management (IAM) policies control access
▶ Can be used for data governance
AWS Glue Database
A set of assciated Data Catalog table definitions organized into a logical group
AWS Glue Tables
The metadata definition that represents your data. The data resides in its original store. This is just a representation of the schema.
Partitions in AWS
Folders where data is stored on S3, which are physical entities, are mapped to partitions, which are logical entities i.e. Columns in the Glue table.
AWS Crawlers
A program that connects to a data store (source or target), progresses through a priotized list of classifiers to determine the schema for your data, and then creates metadata tables in the AWS Glue Data Catalog.
(1) table 생성하기
(2) table 생성하기 using crawlers
그 다음 Athena로 쿼리를 해볼 것인데, 그 전에 결과를 쌓아주는 s3 폴더 생성 필요
AWS Glue Connections
A Data Catalog object that contains the properties that are required to connect to a particular data store.
AWS Glue Jobs
The business logic that is required to perform ETL work. It is composed of a transformation script, data sources, and data targets. Job runs are initiated by triggers that can be scheduled or triggered by events.
AWS Glue Triggers
Initiates an ETL job. Triggers can be define based on a scheduled time or an event.
AWS Glue Dev Endpoints
A development endpoint is an environment that you can use to develop and test your AWS Glue scripts. its essentially an abstacted cluster. NB The cost can add up.