问题描述

当prover异常,或者某些原因导致各个服务不稳定时,prover_jobs 卡在in_gpu_proof 状态,跳过的证明无法自恢复进行补全。

测试数据

主节点

zksync_local=# select l1_batch_number, status, created_at from prover_jobs order by created_at desc;
 l1_batch_number |    status    |         created_at         
-----------------+--------------+----------------------------
            1147 | successful   | 2023-09-06 05:58:18.199706
            1146 | successful   | 2023-09-06 02:57:31.196466
            1145 | successful   | 2023-09-06 02:19:07.253418
            1144 | successful   | 2023-09-06 01:46:06.18188
            1143 | in_gpu_proof | 2023-09-05 11:27:41.230814
            1142 | in_gpu_proof | 2023-09-05 11:27:37.413038
            1141 | in_gpu_proof | 2023-09-05 11:27:31.75103
            1140 | in_gpu_proof | 2023-09-05 11:27:28.767585
            1139 | in_gpu_proof | 2023-09-05 11:27:23.971785

扩展节点

zksync_local=# select l1_batch_number, status, created_at from prover_jobs order by created_at desc;
 l1_batch_number |    status    |         created_at         
-----------------+--------------+----------------------------
            1147 | in_gpu_proof | 2023-09-07 10:59:47.841166
            1146 | in_gpu_proof | 2023-09-07 10:59:46.586542
            1145 | in_gpu_proof | 2023-09-07 10:59:45.218178
            1144 | in_gpu_proof | 2023-09-07 10:59:43.862459
            1143 | in_gpu_proof | 2023-09-07 10:59:42.610455
            1142 | in_gpu_proof | 2023-09-07 10:59:41.344974
            1141 | in_gpu_proof | 2023-09-07 10:59:39.956309
            1140 | in_gpu_proof | 2023-09-07 10:59:38.589833
            1139 | in_gpu_proof | 2023-09-07 10:59:37.219

跟进代码

先确认下当前zkSync是否有自恢复的逻辑

  1. 查看数据操作core/lib/dal/src/prover_dal.rs,找到自恢复数据库操作requeue_stuck_jobs
  2. 找到具体执行服务模块housekeeper,并以默认加载
    #[arg(
         long,
         default_value = "api,tree,eth,data_fetcher,state_keeper,witness_generator,housekeeper"
     )]
     components: ComponentsToRun,
    2023-09-08T12:43:43.561336Z  INFO zksync_core: Starting the components: [HttpApi, WsApi, ExplorerApi, Tree, EthWatcher, EthTxAggregator, EthTxManager, DataFetcher, StateKeeper, WitnessGenerator(None, BasicCircuits), WitnessGenerator(None, LeafAggregation), WitnessGenerator(None, NodeAggregation), WitnessGenerator(None, Scheduler), Housekeeper]
  3. housekeeper初始化
    2023-09-08T12:43:47.244008Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: GcsBlobCleaner with frequency: 60000 ms
    2023-09-08T12:43:47.244008Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: WitnessGeneratorStatsReporter with frequency: 10000 ms
    2023-09-08T12:43:47.244016Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: GpuProverQueueMonitor with frequency: 10000 ms
    2023-09-08T12:43:47.244037Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: ProverStatsReporter with frequency: 5000 ms
    2023-09-08T12:43:47.244048Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: WaitingToQueuedWitnessJobMover with frequency: 30000 ms
    2023-09-08T12:43:47.244050Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: ProverJobRetryManager with frequency: 300000 ms
    2023-09-08T12:43:47.244030Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: L1BatchMetricsReporter with frequency: 10000 ms
    2023-09-08T12:43:47.244171Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: FriProverJobRetryManager with frequency: 30000 ms
    2023-09-08T12:43:47.244281Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: WaitingToQueuedFriWitnessJobMover with frequency: 40000 ms
    2023-09-08T12:43:47.244293Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: SchedulerCircuitQueuer with frequency: 40000 ms
    2023-09-08T12:43:47.244294Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: WitnessGeneratorStatsReporter with frequency: 10000 ms
    2023-09-08T12:43:47.244300Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: FriWitnessGeneratorJobRetryManager with frequency: 30000 ms
    2023-09-08T12:43:47.244296Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: FriProverStatsReporter with frequency: 30000 ms
    此时zksync-server日志循环等待新证明
    2023-09-08T12:43:48.092403Z  INFO zksync_core::eth_sender::opside_send_plug::manage_proof_plug: submit_pending_proofs finish
    2023-09-08T12:43:48.092417Z  INFO zksync_core::eth_sender::opside_send_plug::manage_proof_plug: manage proof start try_fetch_proof_to_send
    2023-09-08T12:43:48.092433Z  INFO zksync_core::eth_sender::opside_send_plug::manage_proof_plug: manage proof: process_resend receiver for start signal
    2023-09-08T12:43:48.445424Z  INFO zksync_core::eth_sender::opside_send_plug::manage_proof_plug: try_fetch_proof_to_send: check dest block:681 commit
    2023-09-08T12:44:08.927334Z  INFO zksync_core::eth_sender::opside_send_plug::manage_proof_plug: sequence_batch_info.block_number == 0 at batch:681
  4. 判断housekeeper检查服务是否正常被执行
    async fn run_routine_task(&mut self) {
     let stuck_jobs = self
         .prover_connection_pool
         .access_storage()
         .await
         .prover_dal()
         .requeue_stuck_jobs(self.processing_timeout, self.max_attempts)
         .await;
     let job_len = stuck_jobs.len();
     for stuck_job in stuck_jobs {
         vlog::info!("re-queuing prover job {:?}", stuck_job);
     }
     // vlog::info!("server.prover.requeued_jobs{}", job_len as u64);
     // vlog::info!("server.prover.self.processing_timeout{}", self.processing_timeout.as_secs() as u64);
     // vlog::info!("server.prover.self.max_attempts{}", self.max_attempts as u64);
     metrics::counter!("server.prover.requeued_jobs", job_len as u64);
    }
    self.processing_timeout : 2700s (45分钟)
    self.max_attempts: 1

    经过测试,run_routine_task被正常定期执行,但是没有执行任务

    2023-09-08T13:13:17.988849Z  INFO zksync_core::house_keeper::prover_job_retry_manager: server.prover.requeued_jobs0
    2023-09-08T13:13:17.988864Z  INFO zksync_core::house_keeper::prover_job_retry_manager: server.prover.self.processing_timeout2700
    2023-09-08T13:13:17.988873Z  INFO zksync_core::house_keeper::prover_job_retry_manager: server.prover.self.max_attempts1
  5. 查看run_routine_task具体数据库操作
    UPDATE prover_jobs
    SET status = 'queued', attempts = attempts + 1, updated_at = now(), processing_started_at = now()
    WHERE (status = 'in_progress' AND  processing_started_at <= now() - $1::interval AND attempts < $2)
    OR (status = 'in_gpu_proof' AND  processing_started_at <= now() - $1::interval AND attempts < $2)
    OR (status = 'failed' AND attempts < $2)
    RETURNING id, status, attempts

    对照681数据

总结分析

对于zkSync本身存在自恢复逻辑housekeeper->run->run_routine_task->requeue_stuck_jobs
然后根据self.processing_timeout, self.max_attempts两个配置参数,进行数据库查询

self.processing_timeout : 2700s (45分钟)
self.max_attempts: 1
UPDATE prover_jobs
SET status = 'queued', attempts = attempts + 1, updated_at = now(), processing_started_at = now()
WHERE (status = 'in_progress' AND  processing_started_at <= now() - $1::interval AND attempts < $2)
OR (status = 'in_gpu_proof' AND  processing_started_at <= now() - $1::interval AND attempts < $2)
OR (status = 'failed' AND attempts < $2)
RETURNING id, status, attempts

初始任务状态为 queued,当任务成功下发到prover时,状态变更为in_gpu_proof,如果当前任务没有在设置时间内processing_timeout完成,则将当前状态改回 queued,重新下发prover(可能是其他prover)
注意:当状态变更为in_gpu_proof时,attempts 会加1
问题:由于默认max_attempts等于1,所以无法进行run_routine_task
解决:增大max_attempts,5
同时需要注意processing_timeout,应大于支持最低显卡类型,单显卡,运行的时间,否则可能导致算力浪费。

PROVER_NON_GPU_MAX_ATTEMPTS=5
PROVER_NON_GPU_GENERATION_TIMEOUT_IN_SECS=3600

方案验证

修改环境变量后

2023-09-08T14:00:44.990770Z  INFO zksync_core::house_keeper::prover_job_retry_manager: re-queuing prover job StuckProverJobs { id: 1149, status: "queued", attempts: 2 }
  1. 相应in_gpu_proof任务状态更新为queued
  2. 链接的prover开始proof生成