Sunday, July 21, 2024

GKE Cluster networking issues and troubleshooting

Typical GKE networking issues and their solutions

The Google Kubernetes Engine (GKE) provides a strong and expandable method for managing applications that are containerised. Nevertheless, networking complexity can provide difficulties and cause connectivity problems, just as in any distributed system. This blog post explores typical GKE networking issues and offers detailed troubleshooting methods to resolve them.

The following are some typical GKE connectivity problems google cloud encounter:

Problems with GKE Cluster control plane connectivity

Perhaps because of network problems, pods or nodes in a GKE cluster are unable to reach the control plane endpoint.

GKE internal communications

  • Within the same VPC, pods cannot reach other pods or services: In a GKE cluster, every pod is assigned a distinct IP address. The functionality of the application may be impacted by a disruption in connectivity between pods within the cluster.
  • Pods cannot be reached by nodes, or vice versa: A GKE cluster can contain numerous nodes to divide the workload of applications for scalability and dependability. A single node can host multiple pods. Nodes may not be able to communicate with the pods they host due to network problems.

Issues with external communication

  • Pods are unable to access online services: Issues with internet connectivity may make it impossible for pods to use databases, external APIs, or other resources.
  • Pods cannot be reached by outside services: It’s possible that services made available by GKE load balancers are unavailable from outside the cluster.

Interaction outside of Cluster VPCs

  • Resources in other VPCs are inaccessible to pods: When pods need to communicate with services in a different VPC (either within the same project or through VPC peering), connectivity problems could occur.
  • Pods are unable to access on-site resources: When GKE clusters must interact with systems in the data centre of your business, issues may arise (for example connecting over VPN or Hybrid Connectivity).

Steps for troubleshooting

Should you experience a connectivity problem in your Google Kubernetes Engine (GKE) environment, there are particular actions you may take to resolve the issue. Kindly consult the troubleshooting tree provided below for a thorough rundown of the suggested troubleshooting procedure.

Step 1: Check for connectivity

A diagnostic tool called connectivity tests allows you to verify that network endpoints are connected to one another. In addition to examining your configuration, it occasionally carries out real-time dataplane analysis between the endpoints. It will assist in confirming whether the network path is accurate and whether any firewall rules or routes are preventing connectivity.

Step 2: Identify the problem

  • Make sure your GKE cluster and GCE VM are on the same subnet. Check if this virtual machine can connect to the external endpoint.
  • If you can connect from the virtual machine, your GKE settings is probably the problem. If not, concentrate on networking VPCs.

Step 3: Examine and correct your GKE setup

Examine the connection using a GKE node. Look into the following areas if it functions from the node but not from a pod:

  • IP Scamming: Verify that it is operational, enabled, and that the ip-masq-agent configmap matches the configuration of your network. The endpoint destination should permit traffic from the pod ip range since communication to the destinations specified in “nonMasqueradeCIDRs” in the configmap yaml is transmitted with source as pod ip address rather than node ip address. Traffic to all default non-masquerade destinations is routed via pod ip address if there is only an ip-masq-agent daemon operating and no configmap for ip-masq-agent. Egress NAT policies will be used to setup this for Autopilot clusters.
  • Network Guidelines: Check the rules of entry and exit for any possible obstructions. If you’re using Dataplane V2, turn on logging.
  • IPtables: The rules of working and non-working nodes should be compared. You might run “sudo iptables-save” on the specific output node to use it.
  • Mesh of services: Consider trying with istio-proxy injection disabled for a test pod in the namespace if you are using Cloud Service Mesh or Istio in your environment to see if the problem persists. If sidecar injection is off and connectivity still functions, the service mesh setup is probably the problem.

Note: Certain procedures, which are only applicable to Standard Clusters, such as verifying IP tables or testing connections from the GKE node, will not function with GKE Autopilot Clusters.

Step 4: Identify problems unique to a node

If a certain node’s connectivity is lost:

  • Compare the setups: Make sure the working nodes match.
  • Verify the use of resources: Check for problems with the CPU, RAM, or cache table.
  • Gather the sosreport from a faulty node. This might facilitate RCA generation.
  • If the problem was limited to GKE nodes, you may apply the logging filter that is described below. To find any prevalent errors, narrow the search down to a certain timestamp. Troubleshooting can be aided by the presence of logs such as connection timeout, OOM kill (oom_watcher), Kubelet is unhealthy, NetworkPluginNotReady, and so on. You can look up additional comparable queries by using GKE Node-level queries.

Step 5: Take care of correspondence with outside parties

Make sure Cloud NAT is turned on for both pod and node CIDRs if you’re having issues with external connectivity with a private GKE cluster.

Step 6: Resolve connectivity problems with the control plane

  • Depending on the type of GKE cluster (Private, Public, or PSC based cluster), connectivity from nodes to the GKE cluster control plane (GKE master endpoint) varies.
  • When it comes to troubleshooting common connectivity issues, including executing connectivity tests to the GKE cluster private or public control plane endpoint, most of the processes for confirming control plane connectivity are identical to those described above.
  • Apart from the aforementioned, confirm that the source is permitted in the control plane authorised networks and that, in the event that the source is located outside of the GKE cluster’s region, global access to the control plane of the cluster is enabled.
  • Make that the cluster is formed with –enable-private-endpoint if routing traffic from outside GKE has to reach the control plane on its private endpoint alone. This attribute shows that the control plane API endpoint’s private IP address is used to govern the cluster. Please be aware that regardless of the public endpoint option, pods or nodes within the same cluster will always attempt to connect to the GKE master via its private endpoint only.
  • Pods of cluster B will always attempt to connect to the public endpoint of cluster A when accessing the control plane of a GKE cluster A with its public endpoint enabled from another private GKE cluster B (such as Cloud Composer). Therefore, they must ensure that the private cluster B has Cloud NAT enabled for outside access and that Cloud NAT IP ranges are whitelisted in control plane authorised networks on cluster A.

In summary

The preceding procedures cover typical connectivity problems and offer a basic framework for troubleshooting. In case the issue is intricate or sporadic, a more thorough examination is necessary. For a thorough root cause study, this entails gathering packet captures on the impacted node (applicable only to standard cluster) or pod (applicable to both autopilot and standard cluster) at the moment of the problem. Kindly contact Cloud Support if you need any additional help with these problems.

Thota nithya
Thota nithya
Thota Nithya has been writing Cloud Computing articles for govindhtech from APR 2023. She was a science graduate. She was an enthusiast of cloud computing.

Recent Posts

Popular Post Would you like to receive notifications on latest updates? No Yes