- Published on
Crawler S3 with LakeFormation and CloudFormation
- Authors
- Name
- Ignacio G. Betegon
1. DataLake Overview
AWS Lakeformation is a service to rule over the data beyond the IAM roles. You can add permissions over data resources. This helps with more fine-grained permissions and adds an interface this helps with the data governance. Lakeformation has several layers of permissions: DataLake Settings
(Create database/Grantable permissions), Amdin users
, Datalake permissions
, Data locations.
2. Ymls's / y pics frontend
Note that all the yml extracts are supposed to be in the same file. And only the first one will contain the template version and the input parameters. We will split them for didactical reasons.
AWSTemplateFormatVersion: 2010-09-09
Parameters:
CrawlerAppsRoleARN: # Your crawler role
Type: String
AppsMetricsDatabaseName: # Database to load your crawler results
Type: String
S3BucketArn: # Source bucket
Type: String
For some reason, we need to declare the database as datalake resource or error.
Resources:
DataLakeCrawlerResource:
Type: AWS::LakeFormation::Resource
Properties:
ResourceArn: !Ref S3BucketArn
RoleArn: !Ref CrawlerAppsRoleARN
UseServiceLinkedRole: true
Grating crawler role and deployment account role admin permissions. Note that database create permissions must be done manually. (I could not figure out how to do it with cloud formation. But in here it is stated It does not support updating the CreateDatabaseDefaultPermissions or CreateTableDefaultPermissions. Those permissions can only be edited in the DataLakeSettings resource via the API.
)
AdminDataLakeSettings:
Type: AWS::LakeFormation::DataLakeSettings
Properties:
Admins:
- DataLakePrincipalIdentifier: !Ref CrawlerAppsRoleARN
- DataLakePrincipalIdentifier: !Sub <Deployment Account ARN> # You will this if automatic deploying
- DataLakePrincipalIdentifier: !Sub <User Account ARN> # Account to use athena
TrustedResourceOwners:
- !Ref CrawlerAppsRoleARN
- !Sub <Deployment Account ARN> # You will this if automatic deploying
- !Sub <User Account ARN> # Account to use athena
Granting datalake data permissions. Both necessary for database and tables.
CrawlerLakeFormationPermissions:
Type: AWS::LakeFormation::Permissions
Properties:
DataLakePrincipal:
DataLakePrincipalIdentifier:
!Ref CrawlerAppsRoleARN
Permissions:
- "ALL"
Resource:
TableResource:
DatabaseName: !Ref AppsMetricsDatabaseName
TableWildcard: {}
DatabaseResource:
CatalogId: !Ref "AWS::AccountId"
Name: !Ref AppsMetricsDatabaseName
DependsOn: AdminDataLakeSettings
UI pic:
Granting s3 data permissions.
S3CrawlerLakeFormationPermissions:
Type: AWS::LakeFormation::Permissions
Properties:
DataLakePrincipal:
DataLakePrincipalIdentifier:
!Ref CrawlerAppsRoleARN
Permissions:
- "DATA_LOCATION_ACCESS"
Resource:
DataLocationResource:
S3Resource: !Ref S3BucketArn
DependsOn: DataLakeCrawlerResource
3. Difficulties
3.1. If using nested stacks, be careful during the rollback of your stacks. If you add data to a table or s3 bucket or change some permissions manually, the rollback might get stuck, and you will have to recreate the resources manually to be able to rollback the stack.
3.2. Template for database permissions and s3 location permissions is the same, but they are different things. Indeed, they have different permissions. This force you to create two resources of the same type, making this counterintuitive.
3.3. String is not an Arn.
3.4. DataLakePrincipalIdentifier only accepts specific arn's. So it is difficult to add permissions to dynamic roles automatically.
4. Who should use Lake formation?
Lake Formation in short bootstrap your account and offers an extra layer of governance over your data. This adds more control and accesibility (both permissions and reachability). But this adds more complexity when handling data since it needs to have LakeFormation approvals. This becomes specially very tedious when the resources are deployed automatically and/or nested stacks. As defined in 3.1. That is why, I reommend adding LakeFormation only if the data handling is totally separated from the operations team. Thus, Lakeformation would provide an easier interface for interacting with the data.