Crawler S3 with LakeFormation and CloudFormation

1. DataLake Overview

AWS Lakeformation is a service to rule over the data beyond the IAM roles. You can add permissions over data resources. This helps with more fine-grained permissions and adds an interface this helps with the data governance. Lakeformation has several layers of permissions: DataLake Settings (Create database/Grantable permissions), Amdin users, Datalake permissions, Data locations.

2. Ymls's / y pics frontend

Note that all the yml extracts are supposed to be in the same file. And only the first one will contain the template version and the input parameters. We will split them for didactical reasons.

AWSTemplateFormatVersion: 2010-09-09

Parameters:
  CrawlerAppsRoleARN:  # Your crawler role
    Type: String

  AppsMetricsDatabaseName:  # Database to load your crawler results
    Type: String

  S3BucketArn:  # Source bucket
    Type: String

For some reason, we need to declare the database as datalake resource or error.

Resources:

  DataLakeCrawlerResource:
    Type: AWS::LakeFormation::Resource
    Properties:
      ResourceArn: !Ref S3BucketArn
      RoleArn: !Ref CrawlerAppsRoleARN
      UseServiceLinkedRole: true

Grating crawler role and deployment account role admin permissions. Note that database create permissions must be done manually. (I could not figure out how to do it with cloud formation. But in here it is stated It does not support updating the CreateDatabaseDefaultPermissions or CreateTableDefaultPermissions. Those permissions can only be edited in the DataLakeSettings resource via the API. )

  AdminDataLakeSettings:
    Type: AWS::LakeFormation::DataLakeSettings
    Properties:
      Admins:
        - DataLakePrincipalIdentifier: !Ref CrawlerAppsRoleARN
        - DataLakePrincipalIdentifier: !Sub <Deployment Account ARN>  # You will this if automatic deploying
        - DataLakePrincipalIdentifier: !Sub <User Account ARN>  # Account to use athena
      TrustedResourceOwners:
        - !Ref CrawlerAppsRoleARN
        - !Sub <Deployment Account ARN>  # You will this if automatic deploying
        - !Sub <User Account ARN>  # Account to use athena

Granting datalake data permissions. Both necessary for database and tables.

  CrawlerLakeFormationPermissions:
    Type: AWS::LakeFormation::Permissions
    Properties:
      DataLakePrincipal:
        DataLakePrincipalIdentifier:
          !Ref CrawlerAppsRoleARN
      Permissions:
        - "ALL"
      Resource:
        TableResource:
          DatabaseName: !Ref AppsMetricsDatabaseName
          TableWildcard: {}
        DatabaseResource:
          CatalogId: !Ref "AWS::AccountId"
          Name: !Ref AppsMetricsDatabaseName
    DependsOn: AdminDataLakeSettings

UI pic:

Granting s3 data permissions.

  S3CrawlerLakeFormationPermissions:
    Type: AWS::LakeFormation::Permissions
    Properties:
      DataLakePrincipal:
        DataLakePrincipalIdentifier:
          !Ref CrawlerAppsRoleARN
      Permissions:
        - "DATA_LOCATION_ACCESS"
      Resource:
        DataLocationResource:
          S3Resource: !Ref S3BucketArn
    DependsOn: DataLakeCrawlerResource

3. Difficulties

3.1. If using nested stacks, be careful during the rollback of your stacks. If you add data to a table or s3 bucket or change some permissions manually, the rollback might get stuck, and you will have to recreate the resources manually to be able to rollback the stack.

3.2. Template for database permissions and s3 location permissions is the same, but they are different things. Indeed, they have different permissions. This force you to create two resources of the same type, making this counterintuitive.

3.3. String is not an Arn.

3.4. DataLakePrincipalIdentifier only accepts specific arn's. So it is difficult to add permissions to dynamic roles automatically.

4. Who should use Lake formation?

Lake Formation in short bootstrap your account and offers an extra layer of governance over your data. This adds more control and accesibility (both permissions and reachability). But this adds more complexity when handling data since it needs to have LakeFormation approvals. This becomes specially very tedious when the resources are deployed automatically and/or nested stacks. As defined in 3.1. That is why, I reommend adding LakeFormation only if the data handling is totally separated from the operations team. Thus, Lakeformation would provide an easier interface for interacting with the data.