Glue

Glue란 무엇인가?

▶ Fully Managed ETL Service

▶ Consists of a Central Metadata Repository - Glue Data Catalog

▶ A spark ETL Engine

▶ Flexible Scheduler

Why Glue?

AWS Glue offers a fully managed serverless ETL Tool. This removes the overhead(추가적인 비용, 시간, 자원), and barriers to entry, when there is a requirement for a ETL service in AWS.

Set Up

▶ S3 bucket

원하는 버킷을 생성하고, 연습할 csv 파일을 경로에 업로드한다.

▶ IAM role

역할 생성 - Glue - Admin(full access)로 일단 생성

AWS Glue Data Catalog

Persistent Metadata Store

▶ it is a managed service that lets you store, annotate, and share metadata which can be used to query and transform data

▶ One AWS Glue Data Catalog per AWS region

▶ Identity and Access Management (IAM) policies control access

▶ Can be used for data governance

AWS Glue Database

A set of assciated Data Catalog table definitions organized into a logical group

AWS Glue Tables

The metadata definition that represents your data. The data resides in its original store. This is just a representation of the schema.

Partitions in AWS

Folders where data is stored on S3, which are physical entities, are mapped to partitions, which are logical entities i.e. Columns in the Glue table.

AWS Crawlers

A program that connects to a data store (source or target), progresses through a priotized list of classifiers to determine the schema for your data, and then creates metadata tables in the AWS Glue Data Catalog.

(1) table 생성하기

(2) table 생성하기 using crawlers

그 다음 Athena로 쿼리를 해볼 것인데, 그 전에 결과를 쌓아주는 s3 폴더 생성 필요

AWS Glue Connections

A Data Catalog object that contains the properties that are required to connect to a particular data store.

Database instance 설정해야하는데, RDS db를 따로 생성해줘야함. 일단 요금이 무서워서 패스

AWS Glue Jobs

The business logic that is required to perform ETL work. It is composed of a transformation script, data sources, and data targets. Job runs are initiated by triggers that can be scheduled or triggered by events.

여러가지 선택이 있는데, 일단 Visual with a blank canvas로 시작

이후 mapping icon 클릭하고 +누르고 target > S3 생성하면 된다.

키를 customer id로 했더니 847개로 파티션이 나눠져서 생성이 되어버림 ㄷㄷ

AWS Glue Triggers

Initiates an ETL job. Triggers can be define based on a scheduled time or an event.

AWS Glue Dev Endpoints

A development endpoint is an environment that you can use to develop and test your AWS Glue scripts. its essentially an abstacted cluster. NB The cost can add up.

Data Science