HiveMQ DNS Discovery Kubernetes

rp85 · February 7, 2020, 6:31am

Hi HiveMQ Team,

I have deployed HiveMQ on Kubernetes (Openstack Platform). Initially when i start application it is working fine and in logs showing nodes added in cluster, extention started etc… But when i check again after 20 or 24 hours logs of hivemq printing below error:

2020-02-07 07:27:47,003 ERROR - Failed to resolve DNS record for address ‘hivemq-discovery-1.kube-prod-hivemq.svc.cluster.local.’.
java.util.concurrent.ExecutionException: java.net.UnknownHostException: failed to resolve ‘hivemq-discovery-1.kube-prod-hivemq.svc.cluster.local.’ after 2 queries
at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:54)
at com.hivemq.extensions.callbacks.DnsClusterDiscovery.loadOtherNodes(DnsClusterDiscovery.java:110)
at com.hivemq.extensions.callbacks.DnsClusterDiscovery.loadClusterNodeAddresses(DnsClusterDiscovery.java:88)
at com.hivemq.extensions.callbacks.DnsClusterDiscovery.reload(DnsClusterDiscovery.java:78)
at cX.e$a.run(Unknown Source)
at com.codahale.metrics.InstrumentedScheduledExecutorService$InstrumentedRunnable.run(Unknown Source)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.net.UnknownHostException: failed to resolve ‘hivemq-discovery-1.kube-prod-hivemq.svc.cluster.local.’ after 2 queries
at io.netty.resolver.dns.DnsResolveContext.finishResolve(DnsResolveContext.java:848)
at io.netty.resolver.dns.DnsResolveContext.tryToFinishResolve(DnsResolveContext.java:809)
at io.netty.resolver.dns.DnsResolveContext.query(DnsResolveContext.java:332)
at io.netty.resolver.dns.DnsResolveContext.access$600(DnsResolveContext.java:62)
at io.netty.resolver.dns.DnsResolveContext$3.operationComplete(DnsResolveContext.java:381)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)
at io.netty.resolver.dns.DnsQueryContext.setFailure(DnsQueryContext.java:220)
at io.netty.resolver.dns.DnsQueryContext.access$300(DnsQueryContext.java:43)
at io.netty.resolver.dns.DnsQueryContext$4.run(DnsQueryContext.java:170)
at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:127)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:474)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
… 1 common frames omitted
Caused by: io.netty.resolver.dns.DnsNameResolverTimeoutException: [/10.233.0.10:53] query timed out after 5000 milliseconds (no stack trace available)

I am pulling latest HiveMQ DNS image from docker hub in Docker file.

Docker file

ARG BASEIMAGE=hivemq/hivemq4:dns-latest

FROM ${BASEIMAGE}

##Use default DNS resolution timeout as default discovery interval
ENV HIVEMQ_DNS_DISCOVERY_INTERVAL 31
ENV HIVEMQ_DNS_DISCOVERY_TIMEOUT 30

##The default cluster transport bind port to use (UDP port)
ENV HIVEMQ_CLUSTER_PORT 7800
ENV HIVEMQ_CONTROL_CENTER_USER admin
ENV HIVEMQ_CONTROL_CENTER_PASSWORD a68fc32fc49fc4d04c63724a1f6d0c90442209c46dba6975774cde5e5149caf8

COPY config-dns.xml /opt/hivemq/conf/config.xml
COPY logback.xml /opt/hivemq/conf/logback.xml
COPY run.sh /opt/hivemq/bin/run.sh
COPY docker-entrypoint.sh /opt/docker-entrypoint.sh

##HiveMQ extentions for Redis, Kafka, IDAM, LDAP etc…

ADD extension/extension.tar.gz /opt/hivemq/extensions/

RUN chown -R hivemq:hivemq /opt/hivemq
&& chmod +x /opt/docker-entrypoint.sh && chmod +x /opt/hivemq/bin/run.sh

##HiveMQ-licence and hivemq4-kafka-extension-license

COPY licence/hivemq.lic /opt/hivemq/license/hivemq-license.lic

COPY licence/hivemq4-kafka-extension-license.elic /opt/hivemq/license/hivemq4-kafka-extension-license.elic

RUN chown -R hivemq:hivemq /opt/hivemq*

##Set TimeZone on Container.
ENV TZ=Asia/Kolkata
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone

RUN ln -sf /dev/stdout /opt/hivemq/log/hivemq.log

ENTRYPOINT ["/opt/docker-entrypoint.sh"]
CMD ["/opt/hivemq/bin/run.sh"]

How to resolve DNS Discovery related issue. Please help on this.

simon_b · February 7, 2020, 8:34am

Hey,

it seems like there’s an issue with the discovery DNS service, did you create it using our recommended manifest? If not, can you post the YAML you used?
Does the HiveMQ cluster behave as expected until you see the error message? E.g, does the cluster size in the log equal the replica count in your deployment manifest?

Also, how did you deploy Kubernetes, did you use Openstack’s magnum service or did you provision the K8s cluster yourself?
Which DNS service are you using? CoreDNS or kube-dns? (You can check this by listing the pods in the kube-system namespace)

In summary, it sounds like some kind of DNS configuration problem but it’s hard to tell at this point.

Best regards,
Simon from the HiveMQ team

rp85 · February 7, 2020, 12:41pm

@simon_b

Please go through below answers for your questions:

##I am using hivemq recommended manifest.
##We are using kubernetes as service which is deployed by DevOPS team in organization
##I am using kubectl for access K8s cluster.
##CoreDNS is used by K8s Cluster
##Replica count is 3 in single hivemq cluster

##hivemq-replicacontroller.yaml

apiVersion: v1
kind: ReplicationController
metadata:
name: hivemq-replica
namespace: kube-hivemq
spec:
replicas: 3
selector:
app: hivemq-cluster
template:
metadata:
name: hivemq-cluster
labels:
app: hivemq-cluster
cluster: hivemq
spec:
hostAliases:
- ip: “10.X.X.X”
hostnames:
- “sit-kafka-cl01a”
- ip: “10.X.X.X”
hostnames:
- “sit-kafka-cl01b”
containers:
- name: hivemq-pods
#image: hivemq/hivemq4:dns-latest
image: registry.xyz.com:4567/k8sproj968/devops/hivemq-dns
resources:
limits:
memory: “4Gi”
cpu: “1000m”
requests:
memory: “4Gi”
cpu: “1000m”
ports:
- containerPort: 8080
protocol: TCP
name: web-ui
- containerPort: 8000
protocol: TCP
name: websocket
- containerPort: 1883
protocol: TCP
name: mqtt
env:
- name: HIVEMQ_DNS_DISCOVERY_ADDRESS
value: “hivemq-discovery.kube-hivemq.svc.cluster.local.”
- name: HIVEMQ_DNS_DISCOVERY_TIMEOUT
value: “30”
- name: HIVEMQ_DNS_DISCOVERY_INTERVAL
value: “31”
readinessProbe:
tcpSocket:
port: 8000
initialDelaySeconds: 30
periodSeconds: 60
failureThreshold: 60
livenessProbe:
tcpSocket:
port: 8000
initialDelaySeconds: 30
periodSeconds: 60
failureThreshold: 60
imagePullSecrets:
- name: gitlab-secret

kind: Service
apiVersion: v1
metadata:
name: hivemq-discovery
namespace: kube-hivemq
annotations:
service.alpha.kubernetes.io/tolerate-unready-endpoints: “true”
spec:
selector:
app: hivemq-cluster
ports:
- name: mqtt
protocol: TCP
port: 1883
targetPort: 1883
- name: websocket
protocol: TCP
port: 8000
targetPort: 8000
clusterIP: None

##hivemq-mqttclient.yaml
kind: Service
apiVersion: v1
metadata:
name: hivemq-mqtt
namespace: kube-hivemq
annotations:
service.spec.externalTrafficPolicy: Local
spec:
type: NodePort
selector:
cluster: hivemq
ports:
- nodePort: 32000
protocol: TCP
port: 1883
targetPort: 1883
name: mqtt
- nodePort: 30000
protocol: TCP
port: 8000
targetPort: 8000
name: websocket

###I have shared both files and required details, please help on this further

simon_b · February 7, 2020, 2:12pm

The configuration seems correct; can you run through the steps here to verify that the DNS record for the headless service returns the node addresses? https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/

–> run nslookup hivemq-discovery.kube-hivemq.svc.cluster.local. in the debug container (dig is also a good tool for this)

Is the error log consistently happening in this case (i.e. every 30 seconds) or is it intermittent, e.g. only happens every few hours or in random intervals?

i have observed that on some K8s clusters the DNS service intermittently will return NXDOMAIN, other errors or straight up time out(like in your case), which is not usually a problem when the cluster is already established.
You might also want to look into how CoreDNS is behaving in this case, maybe it’s in a failure state and restarting or something along those lines. You could also check whether it happens to be the same CoreDNS node that is failing (check the IP in the DnsNameResolverTimeoutException: [/10.233.0.10:53] query timed out statement)

If the error log only happens intermittently and scaling up/down behaves normally otherwise it should be safe to ignore the log statement.

rp85 · February 7, 2020, 4:04pm

@simon_b

I have run nslookup command and its resolving.

kubectl exec -ti dnsutils – nslookup hivemq-discovery.kube-hivemq.svc.cluster.local.
Server: 10.233.0.10
Address: 10.233.0.10#53

Name: hivemq-discovery.kube-hivemq.svc.cluster.local
Address: 192.168.X.X
Name: hivemq-discovery.kube-hivemq.svc.cluster.local
Address: 192.168.X.X
Name: hivemq-discovery.kube-hivemq.svc.cluster.local
Address: 192.168.X.X

Yes, we are getting this error intermittently in logs, mostly i have observed this failed to resolved dnsdiscovery whnen hivemq pods in idle mode whenever no data coming from devices.

simon_b · February 10, 2020, 1:41pm

Alright, in that case i would say it is safe to ignore the error, the DNS record looks correct.

You might want to dig into your CoreDNS logs or try debugging your Openstack infrastructure to figure out why the resolution fails intermittently but it will not lead to problems if the pods can query the address on startup.

rp85 · February 18, 2020, 5:43am

@simon_b

Thanks simon for your great support.

I will check CoreDNS logs and debug further.

Topic		Replies	Views
Hivemq docker and kubernetes HiveMQ Community Edition	13	3329	July 23, 2020
HiveMQ DNS Cluster Discovery Extension HiveMQ Extension SDK	1	532	August 18, 2021
Helm chart issue HiveMQ k8s Operator	5	2490	November 24, 2020
Decouple config files from HiveMQ Docker Image HiveMQ k8s Operator	16	2119	January 11, 2021
Hostname Resolution in Static Cluster Discovery HiveMQ Commercial Offerings	2	477	December 3, 2021

HiveMQ DNS Discovery Kubernetes

I have deployed HiveMQ on Kubernetes (Openstack Platform). Initially when i start application it is working fine and in logs showing nodes added in cluster, extention started etc… But when i check again after 20 or 24 hours logs of hivemq printing below error:

I am pulling latest HiveMQ DNS image from docker hub in Docker file.

Docker file

How to resolve DNS Discovery related issue. Please help on this.

Related topics