0%

CM宕机问题调查与修复

本文主要包括:

  • CM宕机问题调查与修复

CM宕机问题调查与修复

公司的Cloudera Manager宕机了,重启服务启动不起来
报错如下:
CM宕机调查与修复1

问题排查

首先还是先重启服务,然后出现如上图的报错,然后点击完整日志文件,提示去/var/log/cloudera-scm-eventserver/mgmt-cmf-mgmt-EVENTSERVER-ddp1.log.out查看
查看日志,报错如下:

2022-03-17 08:14:35,438 ERROR com.cloudera.cmon.pipeline.PipelineStage: tagger-writer stage encountered error
com.cloudera.cmon.pipeline.ItemRejectedException: java.io.FileNotFoundException: /var/lib/cloudera-scm-eventserver/v3/_b03j.fdt (No space left on device)
        at com.cloudera.cmf.eventcatcher.server.EventIngester$TaggerWriterReceiver.receiveItem(EventIngester.java:71)
        at com.cloudera.cmf.eventcatcher.server.EventIngester$TaggerWriterReceiver.receiveItem(EventIngester.java:50)
        at com.cloudera.cmon.pipeline.PipelineStage.driver(PipelineStage.java:273)
        at com.cloudera.cmon.pipeline.PipelineStage$2.run(PipelineStage.java:149)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: /var/lib/cloudera-scm-eventserver/v3/_b03j.fdt (No space left on device)
        at java.io.RandomAccessFile.open0(Native Method)
        at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
        at org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:441)
        at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:306)
        at org.apache.lucene.index.FieldsWriter.<init>(FieldsWriter.java:83)
        at org.apache.lucene.index.StoredFieldsWriter.initFieldsWriter(StoredF2022-03-17 08:15:36,333 INFO com.cloudera.cmf.eventcatcher.server.EventCatcherService: Starting EventCatcherService. JVM Args: [-XX:+UseConcMarkSweepGC, -XX:+UseParNewGC, -Dmgmt.log.file=mgmt-cmf-mgmt-EVENTSERVER-ddp1.log.out, -Djava.awt.headless=true, -Djava.net.preferIPv4Stack=true, -Xms1073741824, -Xmx1073741824, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=/tmp/mgmt_mgmt-EVENTSERVER-3c8d72dc1ca356c08341427a2af41630_pid12858.hprof, -XX:OnOutOfMemoryError=/opt/cloudera/cm-agent/service/common/killparent.sh], Args: [], Version: 6.2.0 (#968826 built by jenkins on 20190314-1704 git: 16bbe6211555460a860cf22d811680b35755ea81)

很明显,报错是因为磁盘空间不足,然后清理磁盘
但是,清理磁盘过后,发现重启还是启动不起来。
在网上找了很多解决方法,其中一个是

mv /var/lib/cloudera-scm-eventserver /var/lib/cloudera-scm-eventserver-old

经过测试,这种方法对我的情况不适用,还是启动不起来

查看服务器进程,发现没有cloudera-scm-agent进程,只有cloudera-scm-server进程
查看status:

[root@ddp1 ~]# systemctl status cloudera-scm-agent
● cloudera-scm-agent.service - Cloudera Manager Agent Service
   Loaded: loaded (/usr/lib/systemd/system/cloudera-scm-agent.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Thu 2022-03-17 08:16:02 CST; 2h 40min ago
  Process: 20187 ExecStart=/opt/cloudera/cm-agent/bin/cm agent (code=exited, status=0/SUCCESS)
 Main PID: 20187 (code=exited, status=0/SUCCESS)

Mar 17 08:16:01 ddp1 cm[20187]: self.stream.flush()
Mar 17 08:16:01 ddp1 cm[20187]: IOError: [Errno 28] No space left on device
Mar 17 08:16:01 ddp1 cm[20187]: Logged from file cgroups.py, line 482
Mar 17 08:16:01 ddp1 cm[20187]: Traceback (most recent call last):
Mar 17 08:16:01 ddp1 cm[20187]: File "/usr/lib64/python2.7/logging/__init__.py", line 875, in emit
Mar 17 08:16:01 ddp1 cm[20187]: self.flush()
Mar 17 08:16:01 ddp1 cm[20187]: File "/usr/lib64/python2.7/logging/__init__.py", line 835, in flush
Mar 17 08:16:01 ddp1 cm[20187]: self.stream.flush()
Mar 17 08:16:01 ddp1 cm[20187]: IOError: [Errno 28] No space left on device
Mar 17 08:16:01 ddp1 cm[20187]: Logged from file main.py, line 109

通过journalctl -u cloudera-scm-agent查看执行日志

Mar 17 08:16:01 ddp1 cm[20187]: IOError: [Errno 28] No space left on device
Mar 17 08:16:01 ddp1 cm[20187]: Logged from file agent.py, line 636
Mar 17 08:16:01 ddp1 cm[20187]: Traceback (most recent call last):
Mar 17 08:16:01 ddp1 cm[20187]: File "/usr/lib64/python2.7/logging/__init__.py", line 875, in emit
Mar 17 08:16:01 ddp1 cm[20187]: self.flush()
Mar 17 08:16:01 ddp1 cm[20187]: File "/usr/lib64/python2.7/logging/__init__.py", line 835, in flush
Mar 17 08:16:01 ddp1 cm[20187]: self.stream.flush()
Mar 17 08:16:01 ddp1 cm[20187]: IOError: [Errno 28] No space left on device
Mar 17 08:16:01 ddp1 cm[20187]: Logged from file cgroups.py, line 482
Mar 17 08:16:01 ddp1 cm[20187]: Traceback (most recent call last):
Mar 17 08:16:01 ddp1 cm[20187]: File "/usr/lib64/python2.7/logging/__init__.py", line 875, in emit
Mar 17 08:16:01 ddp1 cm[20187]: self.flush()
Mar 17 08:16:01 ddp1 cm[20187]: File "/usr/lib64/python2.7/logging/__init__.py", line 835, in flush
Mar 17 08:16:01 ddp1 cm[20187]: self.stream.flush()
Mar 17 08:16:01 ddp1 cm[20187]: IOError: [Errno 28] No space left on device
Mar 17 08:16:01 ddp1 cm[20187]: Logged from file cgroups.py, line 482
Mar 17 08:16:01 ddp1 cm[20187]: Traceback (most recent call last):
Mar 17 08:16:01 ddp1 cm[20187]: File "/usr/lib64/python2.7/logging/__init__.py", line 875, in emit
Mar 17 08:16:01 ddp1 cm[20187]: self.flush()
Mar 17 08:16:01 ddp1 cm[20187]: File "/usr/lib64/python2.7/logging/__init__.py", line 835, in flush
Mar 17 08:16:01 ddp1 cm[20187]: self.stream.flush()
Mar 17 08:16:01 ddp1 cm[20187]: IOError: [Errno 28] No space left on device

注意:我这里不断的重启CM,讲道理,agent服务应该启动很多次了,但是时间还是停留在早晨8点,就说明,后面的重启操作,其实还没有走到重启agent操作
应该还是在检查ddp1服务器的磁盘空间,但是ddp1上的agent进程已经没有了,所以server进程收不到ddp1磁盘已经正常的反馈。

解决办法:
手动在ddp1服务器上启动cloudera-scm-agent进程:

systemctl start cloudera-scm-agent

之后在CM页面上重启Cloudera Manager Service服务。问题到此解决