![](/img/trans.png)
[英]Apache Flink job is not scheduled on multiple TaskManagers in Kubernetes (replicas)
[英]How to configure in Kubernetes a static hostname of multiple replicas of Flink TaskManagers Deployment and fetch it in a Prometheus ConfigMap?
我有一個 flink JobManager,只有一個 TaskManager 在 Kubernetes 上運行。 為此,我為帶有replicas: 1
的 TaskManager 使用了Service
和Deployment
replicas: 1
。
apiVersion: v1
kind: Service
metadata:
name: flink-taskmanager
spec:
type: ClusterIP
ports:
- name: prometheus
port: 9250
selector:
app: flink
component: taskmanager
Deployment
:
apiVersion: apps/v1
kind: Deployment
metadata:
name: flink-taskmanager
spec:
replicas: 1
selector:
matchLabels:
app: flink
component: taskmanager
template:
metadata:
labels:
app: flink
component: taskmanager
spec:
hostname: flink-taskmanager
volumes:
- name: flink-config-volume
configMap:
name: flink-config
items:
- key: flink-conf.yaml
path: flink-conf.yaml
- key: log4j-console.properties
path: log4j-console.properties
- name: tpch-dbgen-data
persistentVolumeClaim:
claimName: tpch-dbgen-data-pvc
- name: tpch-dbgen-datarate
persistentVolumeClaim:
claimName: tpch-dbgen-datarate-pvc
containers:
- name: taskmanager
image: felipeogutierrez/explore-flink:1.11.1-scala_2.12
# imagePullPolicy: Always
env:
args: ["taskmanager"]
ports:
- containerPort: 6122
name: rpc
- containerPort: 6125
name: query-state
- containerPort: 9250
livenessProbe:
tcpSocket:
port: 6122
initialDelaySeconds: 30
periodSeconds: 60
volumeMounts:
- name: flink-config-volume
mountPath: /opt/flink/conf/
- name: tpch-dbgen-data
mountPath: /opt/tpch-dbgen/data
subPath: data
- mountPath: /tmp
name: tpch-dbgen-datarate
subPath: tmp
securityContext:
runAsUser: 9999 # refers to user _flink_ from official flink image, change if necessary
然后我將數據從 Flink TaskManager 交換到 Prometheus,並使用一個Service
、 ConfigMap
和Deployment
將 Prometheus 設置在 Kubernetes 之上並使其從 Flink Task Manager 獲取數據。
apiVersion: v1
kind: Service
metadata:
name: prometheus-service
spec:
type: ClusterIP
ports:
- name: promui
protocol: TCP
port: 9090
targetPort: 9090
selector:
app: flink
component: prometheus
ConfigMap
是我設置 Flink 任務管理器主機的地方- targets: ['flink-jobmanager:9250', 'flink-jobmanager:9251', 'flink-taskmanager:9250']
與 Flink 的 Kubernetes 對象Service
匹配( flink-taskmanager
):
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
labels:
app: flink
data:
prometheus.yml: |+
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'prometheus'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090']
- job_name: 'flink'
scrape_interval: 5s
static_configs:
- targets: ['flink-jobmanager:9250', 'flink-jobmanager:9251', 'flink-taskmanager:9250']
metrics_path: /
Deployment
:
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus-deployment
spec:
replicas: 1
selector:
matchLabels:
app: flink
component: prometheus
template:
metadata:
labels:
app: flink
component: prometheus
spec:
hostname: prometheus
volumes:
- name: prometheus-config-volume
configMap:
name: prometheus-config
items:
- key: prometheus.yml
path: prometheus.yml
containers:
- name: prometheus
image: prom/prometheus
ports:
- containerPort: 9090
volumeMounts:
- name: prometheus-config-volume
mountPath: /etc/prometheus/prometheus.yml
subPath: prometheus.yml
這很有效,我可以在 Prometheus WEB-UI 上查詢 Flink 任務管理器的數據。 但是,一旦我將replicas: 1
更改為replicas: 3
,我就無法再從任務管理器中查詢數據。 我猜是因為配置- targets: ['flink-jobmanager:9250', 'flink-jobmanager:9251', 'flink-taskmanager:9250']
當有更多 Flink TaskManager 的副本時不再有效。 但是,由於是 Kubernetes 管理新 TaskManager 副本的創建,因此我不知道在 Prometheus 中如何配置此選項。 我想它應該是動態的,或者帶有 * 或者帶有一些可以為我獲取所有任務管理器的正則表達式。 有人知道如何配置它嗎?
我必須根據這個答案https://stackoverflow.com/a/55139221/2096986和文檔讓它工作。 第一件事是我必須使用StatefulSet
而不是Deployment
。 有了這個,我可以將 Pod IP 設置為有狀態。 不清楚的是,我必須將Service
設置為使用clusterIP: None
而不是type: ClusterIP
。 所以這是我的服務:
apiVersion: v1
kind: Service
metadata:
name: flink-taskmanager
labels:
app: flink-taskmanager
spec:
clusterIP: None # type: ClusterIP
ports:
- name: prometheus
port: 9250
selector:
app: flink-taskmanager
這是我的StatefulSet
:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: flink-taskmanager
spec:
replicas: 3
serviceName: flink-taskmanager
selector:
matchLabels:
app: flink-taskmanager # has to match .spec.template.metadata.labels
template:
metadata:
labels:
app: flink-taskmanager # has to match .spec.selector.matchLabels
spec:
hostname: flink-taskmanager
volumes:
- name: flink-config-volume
configMap:
name: flink-config
items:
- key: flink-conf.yaml
path: flink-conf.yaml
- key: log4j-console.properties
path: log4j-console.properties
- name: tpch-dbgen-data
persistentVolumeClaim:
claimName: tpch-dbgen-data-pvc
- name: tpch-dbgen-datarate
persistentVolumeClaim:
claimName: tpch-dbgen-datarate-pvc
containers:
- name: taskmanager
image: felipeogutierrez/explore-flink:1.11.1-scala_2.12
# imagePullPolicy: Always
env:
args: ["taskmanager"]
ports:
- containerPort: 6122
name: rpc
- containerPort: 6125
name: query-state
- containerPort: 9250
livenessProbe:
tcpSocket:
port: 6122
initialDelaySeconds: 30
periodSeconds: 60
volumeMounts:
- name: flink-config-volume
mountPath: /opt/flink/conf/
- name: tpch-dbgen-data
mountPath: /opt/tpch-dbgen/data
subPath: data
- mountPath: /tmp
name: tpch-dbgen-datarate
subPath: tmp
securityContext:
runAsUser: 9999 # refers to user _flink_ from official flink image, change if necessary
在 prometheus 配置文件prometheus.yml
我使用模式StatefulSetName-{0..N-1}.ServiceName.default.svc.cluster.local
映射了主機:
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
labels:
app: flink
data:
prometheus.yml: |+
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'prometheus'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090']
- job_name: 'flink'
scrape_interval: 5s
static_configs:
- targets: ['flink-jobmanager:9250', 'flink-jobmanager:9251', 'flink-taskmanager-0.flink-taskmanager.default.svc.cluster.local:9250', 'flink-taskmanager-1.flink-taskmanager.default.svc.cluster.local:9250', 'flink-taskmanager-2.flink-taskmanager.default.svc.cluster.local:9250']
metrics_path: /
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.