Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.Original topic: k8s Tidb 实践-运维篇
Following the previous article on deploying k8s, this article will conduct basic O&M tests on the TiDB service running on k8s machines.
01 TiDB Component Scaling
Scale down PD to 2 replicas:
kubectl patch -n dba tc dba --type merge --patch '{"spec":{"pd":{"replicas":2}}}'
Scale up PD to 3 replicas:
kubectl patch -n dba tc dba --type merge --patch '{"spec":{"pd":{"replicas":3}}}'
kubectl get po -n dba -o wide
02 Maintain k8s Node
Maintain db53 node
kubectl cordon db53
Scale up a PD, and you will find that no new pods are allocated on db53. Instead, db55 is allocated with 2 pods. This is a simulation, and it is not recommended to have two identical cluster roles on the same node.
Assuming there is a TiKV node, first migrate the TiKV region leader
Add an annotation with key tidb.pingcap.com/evict-leader to the TiKV Pod:
kubectl -n dba annotate pod dba-tikv-2 tidb.pingcap.com/evict-leader="none"
Execute the following command to check if all Region Leaders have been migrated:
kubectl -n dba get tc dba -ojson | jq ".status.tikv.stores | .[] | select ( .podName == \"dba-tikv-2\" ) | .leaderCount"
0
Rebuild TiKV POD
Check TiKV Pod store-id:
kubectl get -n dba tc dba -ojson | jq ".status.tikv.stores | .[] | select ( .podName == \"dba-tikv-2\" ) | .id"
“1”
In any PD Pod, use the pd-ctl command to offline the TiKV Pod:
kubectl exec -n dba dba-pd-0 -- /pd-ctl store delete 1
Note
Before offlining the TiKV Pod, ensure that the remaining TiKV Pods in the cluster are not fewer than the TiKV data replicas configured in PD (configuration item: max-replicas, default value 3). If this condition is not met, you need to scale up TiKV first.
If you encounter the above error, first scale up a TiKV node:
kubectl patch -n dba tc dba --type merge --patch '{"spec":{"tikv":{"replicas":4}}}'
The newly scaled-up node remains in a pending state.
Check the error:
kubectl describe pod dba-tikv-3 -n dba
It indicates that one node is in maintenance status, and the remaining nodes lack sufficient CPU. At this point, we can adjust the resources required by the TiKV node to meet the k8s simulation scenario.
Uncordon the maintenance node:
kubectl uncordon db53.clouddb.bjzdt.qihoo.net
Modify the configuration file:
Apply the configuration:
kubectl replace -f david/tidb-cluster.yaml
You can see that the TiKV pod has restarted.
Then, follow the steps above to maintain the node again, scale up the TiKV replicas to 4, and find that there is no problem now, and the scaling is successful.
Execute the offline TiKV Pod again:
kubectl exec -n dba dba-pd-0 -- /pd-ctl store delete 1
Success!
Unbind the TiKV Pod from the current storage.
Query the PersistentVolumeClaim used by the Pod:
kubectl get pvc -n dba
Delete the PersistentVolumeClaim:
The NAME in the above image is the name of the PVC.
kubectl delete -n dba pvc tikv-dba-tikv-2 --wait=false
Delete the TiKV Pod and wait for the newly created TiKV Pod to join the cluster.
kubectl delete -n dba pod dba-tikv-2
pod “dba-tikv-2” deleted
Wait for the newly created TiKV Pod status to become Up.
From the output, you can see that the new TiKV Pod has a new store-id, and the Region Leader will automatically schedule to this TiKV Pod.
Remove the unnecessary evict-leader-scheduler
kubectl exec -n dba dba-pd-0 -- /pd-ctl scheduler remove evict-leader-scheduler-1
Success!
Check the current TiKV nodes, and you will see that db53 is no longer there. The db53 node maintenance is successful.
03 Deploy TiDB Monitor
Edit the configuration file:
apiVersion: pingcap.com/v1alpha1
kind: TidbMonitor
metadata:
name: dba
namespace: dba
spec:
clusters:
- name: dba
prometheus:
baseImage: prom/prometheus
version: v2.18.1
#limits:
# cpu: 8000m
# memory: 8Gi
#requests:
# cpu: 4000m
# memory: 4Gi
imagePullPolicy: IfNotPresent
logLevel: info
reserveDays: 12
service:
type: NodePort
portName: http-prometheus
grafana:
baseImage: grafana/grafana
version: 6.0.1
imagePullPolicy: IfNotPresent
logLevel: info
#limits:
# cpu: 8000m
# memory: 8Gi
#requests:
# cpu: 4000m
# memory: 4Gi
username: admin
password: admin
envs:
# Configure Grafana using environment variables except GF_PATHS_DATA, GF_SECURITY_ADMIN_USER and GF_SECURITY_ADMIN_PASSWORD
# Ref https://grafana.com/docs/installation/configuration/#using-environment-variables
GF_AUTH_ANONYMOUS_ENABLED: "true"
GF_AUTH_ANONYMOUS_ORG_NAME: "Main Org."
GF_AUTH_ANONYMOUS_ORG_ROLE: "Viewer"
# if grafana is running behind a reverse proxy with subpath http://foo.bar/grafana
# GF_SERVER_DOMAIN: foo.bar
# GF_SERVER_ROOT_URL: "%(protocol)s://%(domain)s/grafana/"
service:
type: NodePort
portName: http-grafana
initializer:
baseImage: pingcap/tidb-monitor-initializer
version: v6.1.0
imagePullPolicy: Always
#limits:
# cpu: 50m
# memory: 64Mi
#requests:
# cpu: 50m
# memory: 64Mi
reloader:
baseImage: pingcap/tidb-monitor-reloader
version: v1.0.1
imagePullPolicy: IfNotPresent
service:
type: NodePort
portName: tcp-reloader
#limits:
# cpu: 50m
# memory: 64Mi
#requests:
# cpu: 50m
# memory: 64Mi
imagePullPolicy: IfNotPresent
persistent: true
storageClassName: shared-ssd-storage
storage: 10Gi
nodeSelector: {}
annotations: {}
tolerations: []
kubePrometheusURL: http://prometheus-k8s.monitoring.svc:9090
alertmanagerURL: ""
You can confirm the PVC status with the following command:
kubectl get pvc -l app.kubernetes.io/instance=dba,app.kubernetes.io/component=monitor -n dba
Apply the configuration file:
kubectl apply -f tidb_monitor.yaml
Access the Grafana monitoring dashboard
For direct access to monitoring data, you can use kubectl port-forward to access Prometheus:
kubectl port-forward --address 10.228.66.152 -n dba svc/dba-grafana 3000:3000 &>/tmp/portforward-grafana.log &
Grafana monitoring
kubectl port-forward --address 10.228.66.152 -n dba svc/dba-prometheus 9090:9090 &>/tmp/portforward-prometheus.log &
Access Prometheus monitoring data
04 Alertmanager Alert Configuration
If there is an available alertmanager service in the existing infrastructure, you can configure alerts as follows:
Modify the tidb_monitor.yaml configuration file
Reapply the configuration
kubectl replace -f tidb_monitor.yaml
If you want to deploy a separate set of services, refer to the official GitHub - prometheus/alertmanager: Prometheus Alertmanager to deploy the alertmanager component.
docker run --name alertmanager -d -p 10.228.66.152:9093:9093 quay.io/prometheus/alertmanager
Generally, customized alerts are needed, such as sending to email or other internal communication software. The following steps require logging into the docker container to modify the configuration file.
docker ps |grep alert to find the alert docker process id
docker exec -it 25ed6524c91a sh to log into the container
Customize the alert to send to email or webhook as needed.
cat >> /etc/alertmanager/alertmanager.yml << EOF
global:
smtp_smarthost: 'localhost:25'
smtp_from: 'alertmanager@example.org'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'
route:
receiver: "webhook-sms"
group_by: ['env','instance','alertname','type','group','job']
group_wait: 30s
group_interval: 3m
repeat_interval: 3h
receivers:
- name: 'webhook-sms'
webhook_configs:
- url: 'http://api.xxxxx/public/alertmanagerSendXSM'
EOF
Stop the pod:
docker stop cdfe1e7beef4
Start the pod:
docker start cdfe1e7beef4
Check the service:
Check the container service of a specific pod:
kubectl get pods dba-tidb-1 -o jsonpath={.spec.containers[*].name}
kubectl exec -it dba-tidb-0 -c tidb -n dba – /bin/sh
-c specifies a particular service
05 Modify Node Configuration Separately
Example for TiKV:
After entering diagnostic mode, modify the configuration.
After letting the TiKV Pod enter diagnostic mode, you can manually modify the TiKV configuration file and specify the modified configuration file to start the TiKV process.
The specific steps are as follows:
Get the TiKV startup command from the TiKV logs, which will be used in subsequent steps.
kubectl logs pod dba-tikv-0 -n dba -c tikv | head -2 | tail -1
The output will be similar to the following, which is the TiKV startup command.
/tikv