| About Me

Cloudformation - NodeGroup failed to stabilize: Internal Failure

A recent change to AWS NodeGroup behaviour means that some CloudFormation stacks which create EKS NodeGroups may start to fail with the error Nodegroup the-nodegroup-name failed to stabilize: Internal Failure. Googling currently doesn't return much. The problem is related to this change relating to whether or not public IP's are assigned to nodes.

Prior to April 22nd, managed node groups always assign public IP's to nodes, irrespective of the value of MapPublicIpOnLaunch on the associated subnet. Going forward public IP's will only be assigned if MapPublicIpOnLaunch is true on the associated subnet.

So if creating the subnet via Cloudformation, we previously had:

Subnet1a:
  Type: AWS::EC2::Subnet
  Properties:
    VpcId: !Ref VPC
    AvailabilityZone:
      Fn::Sub: '${Region}a'
    CidrBlock: 172.16.0.0/18

We would now need to add the final line as below:

Subnet1a:
  Type: AWS::EC2::Subnet
  Properties:
    VpcId: !Ref VPC
    AvailabilityZone:
      Fn::Sub: '${Region}a'
    CidrBlock: 172.16.0.0/18
    MapPublicIpOnLaunch: true

For our existing configuration to continue working. More info is in the AWS post.

The trick to debugging turned out to be setting disable_rollback to true (if using Ansible to manage Cloudformation) so that the NodeGroup wasn't deleted on failure making it possible to go in and inspect the NodeGroup.