Category Archives: Google Cloud

GCP IPSec VPN to on-prem pfSense for Internet egress

Overview

Several different components are involved:

Google Cloud VPN (basic, route-based)
Google Cloud VPC with one or more subnets
pfSense Community Edition (Virtual Machine)
Google Cloud Compute Engine (GCE) (Virtual Machine for testing)
One or more internet providers for the on-prem environment.

Essentially, Google Cloud VPN will establish an IPSec tunnel to the WAN/external interface of pfSense. A static route of 0.0.0.0/0 will be created in the Google Cloud VPC to direct all egress traffic (such as from a GCE instance) to the VPN tunnel. pfSense will have its own VPN configuration to establish the on-prem end of the tunnel, as well as some firewall and NAT rules to properly pass GCP traffic to the Internet.

On-prem

In my lab, I have a pfSense virtual machine (running on vSphere 7) with a virtual NIC connected to my LAN. My LAN uses a cable provider for Internet. I also have a Wireless 5G Internet provider which I have connected directly to the pfSense virtual machine using a USB Ethernet adapter in USB passthrough mode directly to the virtual machine.

The pfSense WAN interface is set to the USB adapter and has a public IP address from the Wireless 5G provider. It also has the LAN interface with a static IP from the 192.168.86.0/23 subnet of my internal network.

I do not use pfSense as my main router, it will only be used as a VPN Gateway for connecting to GCP.

GCP

In GCP, I have a single VPC with a single subnet: 10.252.1.0/24. I have several VPC firewall rules created:

permit Identity-Aware Proxy so that I can use browser-based SSH to connect to GCE instances that have private IPs
permit all traffic from my on-prem subnet 192.168.86.0/23.

There is a test GCE instance on the subnet at 10.252.1.30.

A Classic Cloud VPN was created with a public IP and IKEv2 tunnel with pre-shared key. This is a route-based tunnel (not BGP) and the remote network range was set to 0.0.0.0/0. This tells GCP that all IP addresses (the Internet) can be reached through the VPN tunnel. This automatically creates a route for 0.0.0.0/0 in the VPC with next hop of the VPN tunnel. The VPC’s default route for 0.0.0.0/0 with next hop of “default internet gateway” should be deleted.

Note that GCP Classic Cloud VPN configurations are not recommended. Instead, a Cloud VPN HA configuration with BGP should be used. pfSense is compatible with Cloud VPN HA and BGP. See Site to Site VPN between Google Cloud and pfSense on VMware at home for more details including how to install and configure pfSense from scratch as well as use pfSense with BGP.

pfSense configuration

From the VPN menu, choose IPsec. Create a new Phase1 (P1). Leave the Interface as WAN. In the Remote Gateway field, enter the public IP of the GCP Cloud VPN gateway. Provide the pre-shared key that was used to create the GCP Cloud VPN tunnel.

Set Encryption Algorithm to AES256-GCM and Life Time to 36000.

Save it and then Add a Phase2 (P2). Set Mode to Tunnel IPv4. Set Local Network to Network with address 0.0.0.0/0. This tells GCP that this tunnel accepts traffic destined for 0.0.0.0/0. Set the Remote Network to Network and put in the remote (GCP) subnet. In my case, I set it to 10.252.1.0/24.

Check the box for AES256-GCM and set it to 128 bits. Save it.

From the Status drop-down, choose IPsec. The tunnel should come up shortly. GCP should also show the tunnel as up.

From the Firewall menu, choose Rules and then IPsec. Add a new rule for IPv4 for any source/port to any destination/port, any protocol.

In my LAN I added a static route in my LAN router for 10.252.1.0/24 with next-hop of my pfSense LAN IP. Then I was able to ping the GCE instance from my computer. The GCE instance could also ping IPs in my 192.168.86.0/23 network through pfSense.

At this point, communication should be working between GCP and on-prem networks, but there is one more step to enable Internet egress.

From the Firewall menu, choose NAT and then Outbound. Set Outbound NAT mode to Hybrid (automatic + custom rules). Create a new Mapping. Set Interface to WAN and Source to 10.252.1.0/24. Save and apply the changes.

Now, the GCE instance can reach the Internet through pfSense. To verify, run curl ifconfig.me and verify that the public IP address returned is the one assigned to the pfSense WAN interface.

Recall that I said I had two internet providers: Cable and Wireless 5G. With pfSense, I can easily toggle which provider is used by GCP for Internet access. From the System menu, choose Routing. The Gateways will be listed. By default, only the WAN_DHCP gateway was listed with the public IP for the Wireless 5G provider. I added a new gateway using the LAN interface and my LAN default gateway of 192.168.86.1.

When I change the default gateway to my LAN gateway and apply the changes, my GCE instance now accesses the internet through pfSense’s LAN interface using the cable internet provider instead of the WAN interface Wireless 5G. Technically at this point the Outbound NAT mapping created earlier is not required, but it doesn’t hurt to leave it in place. Toggling the pfSense default gateway back to WAN and applying the change causes the GCE instance to again access the internet using the Wireless 5G connection.

pfSense is a great environment to experiment with site-to-site VPNs, NAT, and routing, and is pretty robust and capable of accomplishing a lot of different lab activities.

Automating Compute Engine Local SSD configuration in Windows

Overview

Similar to other public cloud providers, Google Cloud offers the ability to attach local high-speed disks to virtual machines (VMs). This offering is called Local SSD.

The “Local” in Local SSD means that the disks are physically attached to the hypervisor host where the VM is executing. This provides the VM with high-speed/low-latency data operations over the NVME interface.

On Windows GCE instances (VMs), Local SSD volumes are typically used for temporary or ephemeral data such as the Windows Pagefile, Microsoft SQL Server Temp database, or other high-speed caching needs.

Performance

Certain GCE instance machine sizes come with different quantities of Local SSD disks, which are always 375 GB in size. They can be striped together to create larger volumes with higher performance using logical volume management tools in the Operating System.

Risks

However, again like other public clouds, Local SSD storage is ephemeral and comes with some risks.

Data on volumes created with Local SSD disks can be irretrievably lost and no guarantee is made as to the safety or durability of that data.

Specifically, if the OS is shut down from within the Guest Operating System, the GCE instance will power off and the Local SSD disks and the data they contain will be lost. When the VM is started back up via the Cloud Console, it will have new, empty Local SSD disks which must be initialized again in order to be used.

There are other situations where data on Local SSD disks can be lost, such as if the hypervisor host has an issue and Google Cloud cannot migrate the Local SSD data along with the VM to a new host.

Read more about Local SSD data persistence.

So, generally speaking, Local SSDs offer great performance but the usage of them needs to be understood and the loss of data should not impact the application.

Usage with localssd_init

In Google Cloud, Local SSD disks show up as raw disks that need to be initialized and formatted before they can be used.

To automate the initialization of Local SSDs on Windows GCE instances and create usable volumes, I have developed a simple PowerShell script called localssd_init.ps1 which can be found in my GitHub repository gce-localssd-init-powershell.

The script is configured to check if certain drive letters are present. If the drive letters are missing, it will recreate them using a specific configuration of Local SSD disks.

To use the script, simply copy it to your server and then sign it or configure the ExecutionPolicy.

Next, edit the localssd_init.ps1 script and configure the volumes that will be created using Local SSD disks, starting around line 50:

$LocalSSDConfig = @(
    [LocalSSDVol]@{Name = "SQLTempDB"; DriveLetter = 'E'; LocalSSDQty = 2; NTFSAlloc = '65536' },
    [LocalSSDVol]@{Name = "Pagefile"; DriveLetter = 'P'; LocalSSDQty = 2; NTFSAlloc = '8192'; PostScript = "C:\path\to\pagefile.ps1" }
)

Add, remove, or modify entries in the array. Be sure to have a comma after each line, except for the last line, and separate the key=value pairs with a semicolon.

For each entry,

Update the Name and DriveLetter where the volume will be mounted. The name will be set for the Storage Pool and Volume.
Set the LocalSSDQty to the number of 375 GB Local SSD disks to stripe across and set the NTFSAlloc to the needed NTFS Allocation unit size.

Optionally, set PostScript to the path to a custom script to run after the volume has been created. The custom script could change the system Pagefile configuration to use the new drive (a reboot will be be required) or restart SQL Server so it recreates the Temp database files.

The script uses Windows Storage Spaces to create a new Storage Pool using the necessary quantity of Local SSD volumes using “Simple” resiliency which is the same as Striping and then creates a new Volume using the maximum size of the pool and formats it using NTFS with the required allocation size and mounts it at the required drive letter.

During server creation, the localssd_init.ps1 script can be used to create the Local SSD-backed volumes and then run at every subsequent boot to make sure they exist and recreate them if they are missing.

Simply run the script and the disks will be created. The script will log to the standard output and also create entries in the System Event Log.

To enable the script to run at every boot, use this snippet to configure a new Windows Task Manager job to run at startup. Be sure to update the path to the script:

  Set-ExecutionPolicy RemoteSigned
  $path="C:\path\to\localssd_init.ps1"
  $trigger = new-jobtrigger -atstartup -randomdelay 00:00:10
  register-scheduledjob -trigger $trigger -filepath $path -name localssd_init
  get-scheduledjob

Conclusion

If you are on the fence about using Local SSD on Windows GCE instances because those disks might need to be initialized again after an outage, check out localssd_init.ps1 to make sure they’re always available for use.

Configuring Private Google Access

2 Replies

Private Google Access (PGA) is the method by which resources in Google Cloud such as Google Cloud Compute Engine (GCE), Google Cloud VMware Engine (GCVE) or on-premise, can access Google Cloud APIs without having Internet access or needing an assigned external IP address.

For on-premise clients, this connectivity can take place through private hybrid connections such as Interconnect which may be faster than the subscribed Internet service.

For an overview, check out https://cloud.google.com/vpc/docs/private-google-access

Ultimately, PGA is delivered through a set of public IP addresses which are not advertised on the Internet but are available via the Google Cloud VPC in your project.

199.36.153.4/30 for restricted.googleapis.com is used for services that are supported by VPC Service Controls and blocks access to services that do not support VPC service controls.
199.36.153.8/30 for private.googleapis.com is used for most services (except Workspace web or interactive websites) regardless of their support for VPC Service Controls.

In this post, we’ll cover working with private.googleapis.com which supports most APIs.

To enable clients to use PGA to access common APIs, two things need to be configured:

Routing: Clients need to have a network route to reach these IP addresses.
DNS: Clients need to resolve the public API fully-qualified domain name (FQDN, such as storage.googleapis.com) to one of the 4 IP addresses in the 199.36.153.8/30 range: 199.36.153.8, .9, .10, and .11.

Routing

In order to access the 199.36.153.8/30 range for private.googleapis.com, we need to first determine where our clients are located on the network and how they get to the Internet.

Google Compute Engine

Compute Engine clients on a VPC must meet the requirements to use PGA. Two of these requirements are the VM must not have an external public IP address and it must use a subnet which has PGA enabled. Enabling PGA is done per-subnet, not per-VPC. Clients will access the 199.36.153.8/30 range through the VPC’s standard default internet gateway route.

If the default internet gateway route has been overridden by a custom route that directs Internet-bound traffic (0.0.0.0/0) to a firewall appliance, then an additional custom route must be created for the 199.36.153.8/30 range with a next hop of default-internet-gateway and a priority that is higher than the custom route for 0.0.0.0/0.

This way, traffic for 199.36.153.8/30 will go out the default Internet gateway of the VPC. It doesn’t actually go to the Internet – instead the underlying Google Cloud network will accept that traffic and direct it to the private internal interfaces of Google APIs.

In the screenshot below, the routes are ordered from highest priority (lowest number) to lowest priority (highest number). The private-google-access route has a priority of 90 which is a higher priority than the route for 0.0.0.0/0 with a priority of 100 that leads to another network. Therefore, traffic for the PGA range 199.36.153.8/30 will go out the default Internet gateway of the VPC and not onward to the other network like all other Internet traffic.

Google Cloud VMware Engine

If Google Cloud VMware Engine (GCVE) is configured to use VMware Engine Internet access, then traffic for 199.36.153.8/30 will flow through the GCVE Internet access service.

If Internet access for GCVE workloads has been disabled and Internet-bound traffic is configured to use another network, then traffic for 199.36.153.8/30 will flow through the GCVE VPC peering connection to your peered VPC and route based on the routing configuration of your VPC.

The VPC in your project should be advertising a route for for 0.0.0.0/0 with a priority that overrides the default internet gateway route. Therefore, an additional custom route must be created for the 199.36.153.8/30 range with a next hop of default-internet-gateway and a priority that is higher than the custom route for 0.0.0.0/0.

See the previous screenshot for an example.

See Private Google Access: a closer look for more details about PGA for GCVE.

On-premise

Typically, on-premise clients trying to reach the 199.36.153.8/30 range for private.googleapis.com would take the default route out to the internet. But, since this range is not available on the internet, communication will fail.

Supporting Private Google Access with on-premise clients requires hybrid connectivity between the Google Cloud project VPC and the on-premise environment such as through Interconnect or Cloud VPN.

A route for 199.36.153.8/30 must be advertised by the Cloud Router that is associated with the hybrid connectivity. This way, clients on-premise will have a route to 199.36.153.8/30 that leads back over the hybrid connectivity, instead of out to the internet.

Additionally, if the VPC associated with the hybrid connectivity (Interconnect or Cloud VPN) has its default internet gateway route overridden by a custom route that directs Internet-bound traffic (0.0.0.0/0) to a firewall appliance, then an additional custom route must be created for the 199.36.153.8/30 range with a next hop of default-internet-gateway and a priority that is higher than the custom route for 0.0.0.0/0.

This way, on-premise clients trying to reach the 199.36.153.8/30 range will be directed over the hybrid connectivity and then out the VPC’s default internet gateway to access PGA.

Testing

To test connectivity for any client to the 199.36.153.8/30 range for private.googleapis.com once routing has been configured, use the Test-NetConnection PowerShell cmdlet with port 80 or 443 (only HTTP/S is supported, ICMP Ping is not). The TcpTestSucceeded value should return True if the routing was configured successfully:

Test-NetConnection -computername 199.36.153.8 -port 443


ComputerName     : 199.36.153.8
RemoteAddress    : 199.36.153.8
RemotePort       : 443
InterfaceAlias   : Ethernet
SourceAddress    : 10.210.32.3
TcpTestSucceeded : True

Verification of connectivity should be completed prior to making any DNS changes to avoid an outage for clients trying to reach Google APIs.

DNS

Once routing is in place and tested, changes in DNS must be made. These changes should be done on the DNS server used by clients.

To override the DNS resolution of a Google API fully-qualified domain name (FQDN), we need to create a new zone in DNS along with A- and CNAME-records that resolve the API FQDN to the private IPs. This way, clients will no longer use the publicly-advertised DNS zone and records which resolve to the public IP of the API and will instead use the private IP from the DNS server’s overriding zone.

If a client is configured to check its local hosts file first for DNS resolution prior to using the internal or public DNS, entries in the hosts file can be configured to point a public API FQDN to a PGA IP. This is a safe way to fully test PGA on a single host before affecting all clients by updating the common internal DNS server.

In this example, we override resolution of storage.googleapis.com and point it to one of the private.googleapis.com IP addresses, 199.36.153.8.

% cat /etc/hosts
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting.  Do not change this entry.
##
127.0.0.1       localhost
255.255.255.255 broadcasthost
::1             localhost
199.36.153.8  storage.googleapis.com

Any requests on this client to storage.googleapis.com will now go to 199.36.153.8 and should be successful as long as routing is in place.

To enable all clients which use a common DNS server to use PGA IPs instead of the public IP, create a DNS zone for googleapis.com in the private DNS server.

Within the new googleapis.com zone, create A-records for private.googleapis.com using the four different IP addresses in the 199.36.153.8/30 range: 199.36.153.8, 199.36.153.9, 199.36.153.10, and 199.36.153.11.

On-premise Active Directory clients

If clients are part of an Active Directory (AD) domain, AD DNS can be used to create the zone and A-records:

Be sure to create A-records for each of the IPs:

Next, create a wildcard CNAME-record for *.googleapis.com which points to the private.googleapis.com A records.

Now, when any client tries to resolve any googleapis.com hostname, it will be given one of the PGA IPs.

Compute Engine with Google Cloud DNS

If clients use Google Cloud DNS, such as Compute Engine instances, this same configuration can be made in Cloud DNS.

Create a new private zone for googleapis.com, and then create a record set for private.googleapis.com and add the four IP addresses, and then also create a record set for the wildcard CNAME *.googleapis.com.

Terraform can be used to create zones and record sets. Following is an example for setting up googleapis.com:

resource "google_dns_managed_zone" "dns-zone-googleapis-com" {
  name        = "googleapis-com"
  project     = google_project.svpc.project_id
  dns_name    = "googleapis.com."
  description = "Private Google Access"

  visibility = "private"

  private_visibility_config {
    networks {
      network_url = google_compute_network.shared_hub.id
    }
  }
}

resource "google_dns_record_set" "private-google-access-a" {
  project      = google_project.svpc.project_id
  name         = "private.${google_dns_managed_zone.dns-zone-googleapis-com.dns_name}"
  managed_zone = google_dns_managed_zone.dns-zone-googleapis-com.name
  type         = "A"
  ttl          = 300

  rrdatas = ["199.36.153.8", "199.36.153.9", "199.36.153.10", "199.36.153.11"]
}

resource "google_dns_record_set" "private-google-access-cname" {
  project      = google_project.svpc.project_id
  name         = "*.${google_dns_managed_zone.dns-zone-googleapis-com.dns_name}"
  managed_zone = google_dns_managed_zone.dns-zone-googleapis-com.name
  type         = "CNAME"
  ttl          = 300
  rrdatas      = [google_dns_record_set.private-google-access-a.name]
}

Google Cloud VMware Engine

Google Cloud VMware Engine clients will need to use a DNS server such as Active Directory or Google Cloud DNS (through a Cloud DNS inbound server policy) to resolve overriding domain names.

Additional API domains

Additional zones may need to be configured in the DNS server other than googleapis.com depending on what the client is trying to access. This could include gcr.io and others. See Domain Options for private.googleapis.com.

For example, to override the domain gcr.io, create A records for the domain itself and then a wildcard CNAME for *.gcr.io which points to the domain A records.

When using any Google Cloud service, be sure to review which API FQDNs are in use and determine if additional overriding zones need to be created in the DNS server.

For further implementation details about Private Google Access, see https://cloud.google.com/vpc/docs/configure-private-google-access

Working with Google Cloud Managed Instance Groups

2 Replies

Google Cloud Managed Instance Groups (MIGs) are groups of identical virtual machine instances that serve the same purpose.

Instances are created based on an Instance Template which defines the configuration that all instances will use including image, instance size, network, etc.

MIGs that host services are fronted by a load balancer which distributes client requests across the instances in the group.

MIG instances can also run batch processing applications which do not serve client requests and do not require a load balancer.

MIGs can be configured for autoscaling to increase the number of VM instances in the group based on CPU load or demand.

They can also auto-heal by replacing failed instances. Health checks are used to make sure each instance is responding correctly.

MIGs should be Regional and use VM instances in at least two different zones of a region. Regional MIGs can have up to 2000 instances.

Terraform

Two different modules authored by Google can be used to create an Instance Template and MIG:

Instance template: terraform-google-modules/vm/google//modules/instance_template
Multi-version MIG: terraform-google-modules/vm/google//modules/mig_with_percent

To optionally create an Internal HTTP load balancer, use: GoogleCloudPlatform/lb-internal/google

The following examples below create a service account, two instance templates, a MIG, and an Internal HTTP load balancer.

Pre-requisites

A custom image should be created with nginx installed and running at boot.
A VPC with a proxy-only subnet is required.
The instance template requires a service account.
- The API iam.googleapis.com must be enabled on the project

# Enable IAM API
resource "google_project_service" "project" {
 project = "my-gcp-project-1234"
 service = "iam.googleapis.com"
 disable_on_destroy = false
}

# Service Account required for the Instance Template module
resource "google_service_account" "sa" {
 project = "my-gcp-project-1234"
 account_id = "sa-mig-test"
 display_name = "Service Account MIG test"
 depends_on = [ google_project_service.project ]
}

Update project, account_id, and display_name with appropriate values.

Instance Templates

The instance template defines the instance configuration. This includes which network to join, any labels to apply to the instance, the size of the instance, network tags, disks, custom image, etc.

The MIG deployment requires an instance template.

The instance template requires that a source image have already been created.

In this terraform code example, two instance templates are created:

“A” template – initial version to use in the MIG
“B” template – future upgrade version to use with an optional canary update method

During the initial deployment, each instance template can point to the same custom image for the source_image value. In the future, each instance template should point to a different custom image.

# Instance Template "A"
# Module src: https://github.com/terraform-google-modules/terraform-google-vm/blob/master/modules/instance_template
# Registry: https://registry.terraform.io/modules/terraform-google-modules/vm/google/latest/submodules/instance_template
# Creates google_compute_instance_template
module "instance_template_A" {
 source = "terraform-google-modules/vm/google//modules/instance_template"
 region = "us-central1"
 project_id = "my-gcp-project-1234"
 subnetwork = "us-central-01"
 
 service_account = {
  email = google_service_account.sa.email
  scopes = ["cloud-platform"]
 }

 name_prefix = "nginx-a"
 tags = ["nginx"]
 labels = { mig = "nginx" }
 machine_type = "f1-micro"
 startup_script = "sed -i 's/nginx/'$HOSTNAME'/g' /var/www/html/index.nginx-debian.html"

 source_image_project = "my-gcp-project-1234"
 source_image = "image-nginx"
 disk_size_gb = 10
 disk_type = "pd-balanced"
 preemptible = true
}

# Instance Template "B"
module "instance_template_B" {
 source = "terraform-google-modules/vm/google//modules/instance_template"
 region = "us-central1"
 project_id = "my-gcp-project-1234"
 subnetwork = "us-central-01"
 
 service_account = {
  email = google_service_account.sa.email
  scopes = ["cloud-platform"]
 }

 name_prefix = "nginx-b"
 tags = ["nginx"]
 labels = { mig = "nginx" }
 machine_type = "f1-micro"
 startup_script = "sed -i 's/nginx/'$HOSTNAME'/g' /var/www/html/index.nginx-debian.html"

 source_image_project = "my-gcp-project-1234"
 source_image = "image-nginx"
 disk_size_gb = 10
 disk_type = "pd-balanced"
 preemptible = true
}

Update the following with appropriate values:

Module name
region
project_id
subnetwork – the VPC subnet to use for instances deployed via the template
name_prefix– prefix the name of instance template, it will have a version attached to the name.
- Be sure to include any specific versioning to indicate what is in the custom image.
- Lowercase only.
tags – any required tags
labels – network labels to apply to instances deployed via the template
machine_type – machine size to use
startup_script – startup script to run on each boot (not just deployment)
source_image_project – project where the image resides
source_image – image name
disk_size_gb – size of the boot disk
disk_type – type of boot disk
preemptible – if set to true, instances can be pre-empted as needed by Google Cloud.
- Preemptible instances can run for up to 24 hours before being stopped.
- The MIG will recreate replacements when preemptible capacity is available again.
- This is a cost-saving measure and should be used where possible.
- See Instance groups | Compute Engine Documentation | Google Cloud

More instance template module options are available:

See the module source for variables.tf and main.tf
See the resource definition for google_compute_instance_template

Changes to the instance template will result in a new version of the template. The MIG will be modified to use the new version. All MIG instances will be recreated. See the update_policy section of the MIG module definition (below) to control the update behavior.

Managed Instance Group

The MIG creates the set of instances using the same custom image and image template. Instances are customized as usual during first boot.

A custom startup script can run every time the instance starts and configure the VM further. See Overview | Compute Engine Documentation | Google Cloud

In this Regional MIG terraform example, the initial set of instances are deployed using the “A” template set as the instance_template_initial_version.

The same “A” template is also set for the instance_template_next_version with a value of 0 for the next_version_percent.

In a future canary update, set the instance_template_next_version to the “B” template with an appropriate value for next_version_percent.

# Regional Managed Instance Group with support for canary updates 
# Module src: https://github.com/terraform-google-modules/terraform-google-vm/tree/master/modules/mig_with_percent 
# Registry: https://registry.terraform.io/modules/terraform-google-modules/vm/google/latest/submodules/mig_with_percent 
# Creates google_compute_health_check.http (optional), google_compute_health_check.https (optional), google_compute_health_check.tcp (optional), google_compute_region_autoscaler.autoscaler (optional), google_compute_region_instance_group_manager.mig 

module "mig_nginx" { 
 source = "terraform-google-modules/vm/google//modules/mig_with_percent" 
 project_id = "my-gcp-project-1234" 
 hostname = "mig-nginx" 
 region = "us-central1" 
 target_size = 4
 
 instance_template_initial_version = module.instance_template_A.self_link 

 instance_template_next_version = module.instance_template_A.self_link 
 next_version_percent = 0 
 
 //distribution_policy_zones = ["us-central1-a", "us-central1-f"]
 
 update_policy = [{ # See https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_region_instance_group_manager#nested_update_policy 
  type = "PROACTIVE" 
  instance_redistribution_type = "PROACTIVE" 
  minimal_action = "REPLACE" 
  max_surge_percent = null 
  max_unavailable_percent = null 
  max_surge_fixed = 4 
  max_unavailable_fixed = null 
  min_ready_sec = 50 
  replacement_method = "SUBSTITUTE" 
  }] 
 
 named_ports = [{ 
  name = "web" 
  port = 80 
 }] 
 
 health_check = { 
  type = "http" 
  initial_delay_sec = 30 
  check_interval_sec = 30 
  healthy_threshold = 1 
  timeout_sec = 10 
  unhealthy_threshold = 5 
  response = "" 
  proxy_header = "NONE" 
  port = 80 
  request = "" 
  request_path = "/" 
  host = "" 
 } 
 
 autoscaling_enabled = "false" 
 /* 
 max_replicas = var.max_replicas 
 min_replicas = var.min_replicas 
 cooldown_period = var.cooldown_period 
 autoscaling_cpu = var.autoscaling_cpu 
 autoscaling_metric = var.autoscaling_metric 
 autoscaling_lb = var.autoscaling_lb 
 autoscaling_scale_in_control = var.autoscaling_scale_in_control 
 */ 
}

Update the following with appropriate values:

Module name
project_id
hostname – the prefix for provisioned VM names/hostnames. Will have a random set of 4 characters appended to the end.
region
target_size – number of instances to create in the MIG. Does not need to equal the number of zones in distribution_policy_zones.
instance_template_initial_version – template to use for initial deployment
instance_template_next_version – template to use for future canary update
next_version_percent – percentage of instances in the group (of target_size) that should use the canary update
distribution_policy_zones – zone names in the region where VMs should be provisioned.
- Optional. If not specified, the Google-authored terraform module will automatically select each zone in the region.
  - Example: us-central1 region has 4 zones so each zone will be populated in this field. This directly impacts the update_policy and its max_surge_fixed value.
- This value cannot be changed later. The module will ignore any changes.
  - The MIG will need to be destroyed and recreated to update the zones to use.
- More than two zones can be specified.
- The target_size does not need to match the number of zones specified.
- See About regional MIGs | Compute Engine Documentation | Google Cloud .
update_policy – specifies how instances should be recreated when a new version of the instance template is available.
- type set to
  - PROACTIVE will update all instances in a rolling fashion.
    - Leave max_unavailable_fixed as null which results in a value of 0, meaning no live instances can be unavailable.
    - Recommended
  - OPPORTUNISTIC means “only when you manually initiate the update on selected instances or when new instances are created. New instances can be created when you or another service, such as an autoscaler, resizes the MIG. Compute Engine does not actively initiate requests to apply opportunistic updates on existing instances.”
    - Not recommended
- max_surge_fixed indicates the number of additional instances that are temporarily added to the group during an update.
  - These new instances will use the updated template.
  - Should be greater than or equal to the number of zones in distribution_policy_zones. If there are no zones specified in distribution_policy_zones, as mentioned previously, the Google-authored MIG module will automatically select all the zones in the region.
- replacement_method can be set to either of the following values:
  - RECREATE instance name is preserved by deleting the old instance and then creating a new one with the same name.
  - SUBSTITUTE will create new instances with new names.
    - Results in a faster upgrade of the MIG – instances are available sooner than using RECREATE.
    - Recommended.
- See Terraform Registry and Automatically apply VM configuration updates in a MIG | Compute Engine Documentation | Google Cloud
named_ports – set the port name and port number as appropriate
health_check – set the check type, port, and request_path as appropriate
Autoscaling can also be configured. See Autoscaling groups of instances | Compute Engine Documentation | Google Cloud

More MIG module options are available:

See the module source of variables.tf and main.tf
See the resource definition for google_compute_region_instance_group_manager

Changes to the MIG may result in a VMs needing to update. See the update_policy section of the MIG module definition (above) to configure the behavior when updating the MIG members.

Load Balancer

An Internal Load Balancer can make a MIG highly available to internal clients.

module "ilb_nginx" {
 source = "GoogleCloudPlatform/lb-internal/google"
 version = "~4.0"
 project = "my-gcp-project-1234"
 network = module.vpc_central.network_name
 subnetwork = module.vpc_central.subnets["us-central1/central-01-subnet-ilb"].name
 region = "us-central1"
 name = "ilb-nginx"
 ports = ["80"]
 source_tags = ["nginx"]
 target_tags = ["nginx"]

 backends = [{
  group = module.mig_nginx.instance_group
  description = ""
  failover = false
 }]

 health_check = {
  type = "http"
  check_interval_sec = 30
  healthy_threshold = 1
  timeout_sec = 10
  unhealthy_threshold = 5
  response = ""
  proxy_header = "NONE"
  port = 80
  request = ""
  request_path = "/"
  host = ""
  enable_log = false
  port_name = "web"
 }
}

Update the following with appropriate values:

Module name
project
network and subnetwork – the VPC and proxy-only subnet to use
region
name
ports – the port to listen on
source_tags and target_tags – network tags to use, should be present on the MIG members via the instance template.
backends – points to the MIG
health_check – should generally match the MIG healthcheck.

More options are available, : see the module source for variables.tf and main.tf

Be sure to consider any necessary firewall rules, especially if using network tags.

The Google-authored MIG module has create_before_destroy set to true, so a new MIG can replace an existing one as a backend behind the load balancer via a very minimal outage (less than 10 seconds). The new MIG will be created and added as a backend, and then the old MIG will be destroyed.

Day 2 operations

Changing size of MIG

If needed, adjust the target_size value of the MIG module to increase or decrease the number of instances. Adjustments take place right away.

If increasing the number of instances and a new template is in place and the update_policy is OPPORTUNISTIC, the new instances will be deployed using the new template.

Changing the zones to use for a MIG

Cannot be changed after creation. MIG must be destroyed and recreated.

Deleting MIG members

Deleting a MIG member automatically reduces the target number of instances for the MIG. The deleted member is not replaced.

Restarting MIG members

Do not manually restart a MIG instance from within the VM itself. This will cause its healthcheck to fail and the MIG will delete/recreate the VM using the template.

Use the RESTART/REPLACE button in the Cloud Console and choose the Restart option. This affects all instances in the group, but can be limited to only acting against a maximum number at a time (“Maximum unavailable instances”).

The “Replace” option within RESTART/REPLACE will delete and recreate instances using the current template.

Updating MIG instances to a new version

When a new version of the custom image is released, such as when it has been updated with new software, the MIG can be updated in a controlled fashion until all members are running the updated version, without any outage.

The MIG module update_policy setting is very important for this process to ensure there is no outage:

max_surge_fixed is the number of additional instances created in the MIG and verified healthy before the old ones are removed.
- Should be set to greater than or equal to the number of zones in distribution_policy_zones
max_unavailable_fixed should be set to null which equals 0: no live instances will be unavailable during the update.

The MIG module has options for two different instance templates in order to support performing a canary update where only a percentage of instances are upgraded to the new version:

instance_template_initial_version – template to use for initial deployment
instance_template_next_version – template to use for future canary update
next_version_percent – percentage of instances in the group (of target_size) that should use the canary update

Initially, both options may point to the same template and 0% is allocated to the “next” version.

If a load balancer is used, newly created instances that are verified healthy will automatically be selected to respond to client requests.

Canary update

To move a percentage of instances to the “next” version via a “canary” update:

Set the instance_template_next_version to point to an instance template which uses an updated custom image
Set the next_version_percent to an appropriate percentage of instances in the group that should use the “next” template.
Make sure update_policy has type set to PROACTIVE – this will cause the change to take effect right away.

When applied via terraform, all instances will be recreated (adhering to the update_policy) but a percentage of instances will be created using the “next” template.

After the canary update has been validated and all instances should be upgraded, see the steps below for a Regular update.

Regular update

To update all instances at once (adhering to the update_policy):

Set both the instance_template_initial_version and instance_template_next_version to point to an instance template which uses an updated custom image
Set the next_version_percent to 0.
Make sure update_policy has type set to PROACTIVE – this will cause the change to take effect right away.

When applied via terraform, all instances will be recreated (adhering to the update_policy).

Site to Site VPN between Google Cloud and pfSense on VMware at home

Home Network

When we built the house in 2015, I set up a 3-pack of original Google Wi-Fi (not the “Nest” version) to use as my router and access points throughout the house. Google Wi-Fi is great – it’s very easy to get started. Once deployed, it can generally be thought of as “set it and forget it”. However, it doesn’t provide all the bells and whistles that some of the more advanced home routers offer, but this can be a blessing in disguise because there is less to fiddle with and potentially mess up. Most importantly, it delivers a reliable experience for the family.

My home lab is a simple Intel NUC with a dual-core Intel Core i3-6100U 2.3 GHz CPU and 32 GB RAM. It runs a standalone instance of VMware ESXi 7. I run a few VMs when I need to, but nothing “production”.

Site-to-Site VPN with Google Cloud

Since switching to a full-time focus on cloud engineering and architecture, one of the things I’ve always wanted to try is to set up a Site-to-Site IPsec VPN tunnel with BGP between my home and a virtual private cloud (VPC) network to better understand the customer experience for VPN configuration and network management.

As I mentioned earlier, Google Wi-Fi is rather basic and doesn’t offer any VPN capability, but it can do port forwarding, and when combined with a virtual appliance, that’s all we really need.

pfSense overview

Since Google Wi-Fi does not have any VPN capabilities, I intend to use a pfSense virtual appliance in ESXi to act as a router for virtual machine clients on an internal ESXi host-only network. The host-only network will have no physical uplinks so the only way out to the Internet or the private cloud network is through the pfSense router.

pfSense will provide DHCP, DNS, NAT, and routing/default gateway services only to the clients on the internal host-only network.

Because of the way it is designed, no other router can sit between Google Wi-Fi and the internet without some loss of functionality, including mesh networking, and we do not want to disturb the other users of the network (family), so we will create an isolated network with pfSense on the ESXi host.

pfSense VM will have two virtual NICs:

NIC1 is connected to the “VM Network” and has Internet access through the home network.
NIC2 is connected to the internal isolated “host-only” network which does not have any connectivity to the Internet.

Network diagram showing connectivity between the Internet and pfSense running as a virtual machine straddling two networks in the ESXi host on the Intel NUC.

pfSense installation

Netgate has a comprehensive guide on how to install the pfSense virtual appliance on VMware ESXi.

Following are some installation tips that I found to be helpful:

Upload the pfSense ISO to an ESXi datastore – don’t forget to unzip it first.
When creating a new VM for pfSense on ESXi 7, select Guest OS family “Other” and Guest OS version “FreeBSD Pre-11 versions (64-bit)”
VM Hardware:
- Set CPU to 2
- Set Memory to 1 GB
- Set Hard Disk to 8 GB
  - Make sure the SCSI adapter is LSI Logic Parallel
- Set Network Adapter 1 to the home/internet network, mark it as Connect
- Add a second Network Adapter for the host-only network, leave it as E1000, mark it as Connect
- CD/DVD Drive 1 set to Datastore ISO file and browse for the pfSense ISO, mark is as Connect

Boot the VM off the ISO, accept the defaults and let it reboot.

On first boot, the WAN interface will have a DHCP IP from the home network (Google Wi-Fi assigns in the 192.168.86.0/24 range) and the internal-facing LAN interface will have a static IP of 192.168.1.1. If this is incorrect, use the “Assign interfaces” menu item in the console to set which NIC corresponds to WAN and LAN appropriately. Use the ESXi configuration page to find the MAC address of each NIC and which network it is connected to in order to configure them appropriately.

Port-forward IPSec ports to pfSense

After pfSense is installed, we need to port-forward the external Internet-facing IPSec ports on the Google Wi-Fi router to the pfSense VM.

Google has recently relocated management of Google Wi-Fi to the Google Home app. Look for the Wi-Fi area, click the “gear” icon in upper right, select “Advanced networking”, and then “Port management”.

Use the “+” button to add a new rule. Scroll through the IPv4 tab to find the new “pfSense” entry and select it. Verify the MAC address shown is the same as the pfSense VM’s WAN NIC connected to the home network (“VM Network”). Add an entry for UDP 500. Repeat for UDP 4500.

Note: It is not possible to configure port forwarding unless the internal target is online. The Google Home app will only show a list of active targets that are connected to the network. If the pfSense host is not present, verify the VM is powered on and connected to the home network.

Port forwarding rules for inbound UDP/500 and UDP/4500 forwarding to the pfSense NIC1 on the home network

By default, pfSense only allows management access through its LAN interface, so the next step is to deploy a Jump VM with a web browser on the host-only network. Use the VM console to access the Jump VM desktop and launch the browser since it will not be reachable on the home network (in case you wanted to RDP). Verify it has a 192.168.1.0/24 IP. It should also be able to reach the internet but this is not required.

pfSense initial configuration

On the Jump VM, browse to https://192.168.1.1, accept the certificate warning, and log in as admin with password pfsense. Step through the wizard.

Some tips:

Set the Hostname and Domain to something different than the rest of the network.
Configure WAN interface: Uncheck “Block RFC1918 Private Networks”
Set a secure password for admin
Select Interfaces | WAN
- Uncheck “Block bogon networks” if selected
- Click Save and then Apply

Google Cloud VPN configuration

Use the Google Cloud Console for the following steps:

Networking | VPC Networks
- Create a new VPC network or use an existing one. Should have Dynamic routing mode set to Global.
Networking | Hybrid Connectivity | VPN
- Create a new VPN Connection
  - Classic VPN
  - Select VPC network created earlier
  - Create a new external IP address or use an available one
  - Tunnels – set Remote peer IP address to the home external internet IPv4 address (from home, visit https://whatismyipaddress.com/ and note the IPv4 address)
  - Generate and save the pre-shared key – it is needed for pfSense.
  - Select Dynamic (BGP) routing option and create a new Cloud Router. Set Google ASN to 65000. Create a new BGP session, set Peer ASN (pfSense) to 65001. Enter Cloud Router BGP IP of 169.254.0.1, and BGP peer IP (pfSense) of 169.254.0.2
  - Note the external public IP address of the Cloud VPN.

pfSense IPsec configuration

Use the Jump VM web browser for these steps in the pfSense web interface:

System | Advanced | Firewall & NAT tab: Allow APIPA traffic
VPN | IPSsec, Add P1
- Set Remote Gateway to the Google Cloud VPN external public IP recorded previously.
- Set “My identifier” to be “IP address” and enter the external public IPv4 address of the home network recorded earlier.
- Enter the Pre-Shared Key generated for the Google Cloud VPN tunnel
  - It may not be possible to paste the key in to the VM console – visit https://pastebin.com and create a new “Burn after reading” paste with the key and then access the paste from the Jump VM to retrieve the key.
- Set the Phase 1 Encryption Algorithm to AES256-GCM
- Set Life Time to 36000
Save and apply changes
Show P2 entries, Add P2
- Mode: Routed (VTI)
- Local network: Address, BGP IP 169.254.0.2
- Remote network: Address, BGP IP 169.254.0.1
- Protocol: ESP
- Encryption Algorithms
  - AES, 128 bits
  - AES128-GCM, 128 bits
  - AES192-GCM, Auto
  - AES256-GCM, Auto
- Hash Algorithms: SHA256
- PFS key group: 14 (2048 bit)
Save, Apply changes
Click on Firewall | Rules, select IPsec from along the top, Add a new rule
- Set Protocol to Any
Save rule, Apply changes

pfSense BGP configuration

Go to System | Package Manager, click on Available Packages, search for “frr”. Install “frr”. This will connect out to the Internet to retrieve the packages. Wait for it to complete successfully.

Go to Services | FRR Global/Zebra

Global Settings
- Enable FRR
- Enter a master password.
- Set Syslog Logging to enabled and set Package Logging Level to Extended
Click on Access Lists along the top
- Add a new Access List
  - Name: GCP
  - Access List Entries: set Sequence to 0, set Action to Permit, check box for Source Any
  - Click Save
Click on Prefix Lists along the top
- Add a new Prefix List
  - Name: IPv4-any
  - Prefix List Entries: set Sequence to 0, set Action to Permit, check box for Any
  - Click Save
Click on BGP along the top
- Enable BGP Routing
- Set Local AS to 65001 (GCP Cloud Router was set to 65000)
- Set Router ID to 169.254.0.2 (GCP Cloud Router was set to 169.254.0.1)
- Set Hold Time to 30
- At the bottom, set Networks to Distribute to 192.168.1.0/24
- Click Save
Click Neighbors along the top, add a new Neighbor
- Name/Address: 169.254.0.1
- Remote AS: 65000
- Prefix List Filter: IPv4-any, for both Inbound & Outbound
- Path Advertise: All Paths to Neighbor
- Save

Checking status

In pfSense, click on Status | FRR

In the Zebra Routes area, you should see “B>*” entries for subnets in the GCP VPC “via 169.254.0.1” (BGP IP of GCP Cloud Router)

In the BGP Routes area, should see Networks listed for GCP VPC subnets, with Next Hop of 169.254.0.1 (BGP IP of GCP Cloud Router) and Path of 65000 (GCP Cloud Router ASN)

BGP Neighbors should list 169.254.0.1 as a neighbor with remote AS 65000, local AS 65001 and a number of “accepted prefixes” which are the VPC subnets.

Visit the Cloud VPN area in Google Cloud Console, the VPN Tunnel should show Established, and the BGP session should also show BGO established.

Visit the VPC and click on its Routes. There should be one listed for the on-premise pfSense LAN, 192.168.1.0/24 via next hop 169.254.0.2.

Validating connectivity

At this point, VMs in GCP should be able to communicate with VMs in the on-premise pfSense LAN 192.168.1.0/24 network.

Create a GCE instance with no public IP and attach it to the VPC subnet. Make sure firewall rules apply to the instance permit ingress traffic from 192.168.1.0/24 network and permit the appropriate ports and protocols:

icmp
TCP 22 for SSH
TCP 3389 for RDP

Wrapping up

If things are not connecting, double-check everything, but also be sure to check the logs in pfSense and in GCP Cloud Logging. The most frequent issue I encountered was a mismatch of proposals by not selecting the right ciphers for the tunnel, or not setting my identifier properly. Also consider how firewall rules will impact communication.

Finally, the settings outlined here are obviously not meant for production use. I don’t claim to understand BGP any more than what it took to get pfSense working with Cloud VPN, so some of the settings I recommend could be enhanced and tightened from a security perspective. As always, your mileage may vary.