您正在查看: Surou 发布的文章

Git 修改.Submodule文件 url 生效

  1. 修改 .gitmodules 文件中对应模块的url属性;
  2. 使用 git submodule sync 命令,将新的URL更新到文件.git/config;
  3. 再使用命令初始化子模块:git submodule init
  4. 最后使用命令更新子模块:git submodule update

zkSync 根据l2到l1跨链hash查询CommitBlocks,PublishProofBlocksOnchain,ExecuteBlocks各个L1层交易hash

跨链交易发起:0x0bbbcf153f17ec0e1f12d698bbc64f9d242bc1cd1312dd5f34febf0e6cb6601a

  1. 根据tx_hash查看表l2_to_l1_logs,得到所在的miniblock_number15841
  2. 根据上面的miniblock_number,根据number查看表miniblocks
    select * from miniblocks where number=15841;
  3. 得到当前跨链交易所在l1_batch_number为7670
  4. 根据l1_batch_number,以number查看l1_batches
    select number,is_finished,eth_commit_tx_id,eth_prove_tx_id,eth_execute_tx_id from l1_batches where number=7670;

    得到

    number | is_finished | eth_commit_tx_id | eth_prove_tx_id | eth_execute_tx_id 
    --------+-------------+------------------+-----------------+-------------------
    7670 | t           |            41073 |           41148 |             41151
  5. 根据eth_commit_tx_id查询eth_txs,得到CommitBlocks对应交易信息
    select nonce,contract_address,tx_type,has_failed,sent_at_block,tx_status,confirmed_eth_tx_history_id from eth_txs where id=41073 ORDER BY updated_at DESC limit 1;
    nonce |              contract_address              |   tx_type    | has_failed | sent_at_block | tx_status | confirmed_eth_tx_history_id 
     -------+--------------------------------------------+--------------+------------+---------------+-----------+-----------------------------
      41091 | 0x5e3e5f6ef0e21f0cf5b4c3acd3cf29740b1cbbd8 | CommitBlocks | f          |               | Done      |                       43194
  6. 根据confirmed_eth_tx_history_id得到CommitBlocks对应交易hash
    select eth_tx_id,tx_hash,confirmed_at from eth_txs_history where id=43194  ORDER BY updated_at DESC limit 10;
    eth_tx_id |                              tx_hash                               |        confirmed_at        
     -----------+--------------------------------------------------------------------+----------------------------
          41073 | 0x0b01e199877faef52b95477119f53bf546a2915bc903132331f41542e58da53d | 2023-09-18 03:50:19.059576
  7. 同理查询eth_prove_tx_ideth_execute_tx_id得到对应的交易hash

zkSync Era宕机问题排查

问题背景

从一些外部消息得知

9月12日消息,据zkSync Era区块链浏览器显示,zkSync Era主网疑似出现宕机情况,zkSync Era提交给以太坊的最新批次为#208455,时间为14:14,区块高度已暂停于#13641404,已暂停出块37分钟。

跟进缘由

由于现有部分项目基于zkSync Era,所以需要确认下问题起因是什么,是否存在官方新版修复,目前现有版本会不会同样存在问题

确认问题

先从浏览器数据,确认下,是否存在消息描述问题,以及分析下问题位置(区块浏览器/链节点)

确认下Batch高度时间

Batch高度 区块时间 链接 位置
208455 2023-09-12 14:14 https://explorer.zksync.io/batch/208455
208456 2023-09-12 14:14 https://explorer.zksync.io/batch/208456
208457 2023-09-12 14:15 https://explorer.zksync.io/batch/208457 后+1

208455 与 208456 相差时间符合预期

确认下Block高度时间

Block高度 Batch高度 Committed时间 链接 位置
13641404 208456 2023-09-12 14:14 https://explorer.zksync.io/block/13641404
13641405 208456 2023-09-12 14:14 https://explorer.zksync.io/block/13641405
13641406 208456 2023-09-12 14:14 https://explorer.zksync.io/block/13641406 后+1

确认下Batch Commit时间

Batch高度 Commit时间 Commit tx hash 位置
208455 Sep-12-2023 06:15:59 AM +UTC https://etherscan.io/tx/0x369446bc9d99087aa1160d426b7af372dce91bb7d372724b7c529f2e3ff30ecd
208456 Sep-12-2023 06:16:35 AM +UTC https://etherscan.io/tx/0x902b3b0eee2e82ef048e8de8ec0417d7875c0930b5b0b893de48f8b5b59f8944
208457 Sep-12-2023 06:17:59 AM +UTC https://etherscan.io/tx/0x7ce6d03ead9117a0a7268042c6e19637c702df2382070f04442df75602461661 后+1

分析结果

从节点Batch和Block生成时间,以及Batch Commit,对比消息中的#208455前后时间差,综合来看,链方面数据无宕机,
大概率是当时区块浏览器服务方面,或者连接的某些提供数据RPC节点,出现了区块同步不及时问题。

如何避免

浏览器和提供查询的RPC节点做多个主备灾备, 实时高度状态检查和线路自动切换

显卡驱动与CUDA版本对照

Driver Version CUDA Version docker image
510 11.6 docker pull nvidia/cuda:11.6.2-runtime-ubuntu20.04
525 12.0 docker pull nvidia/cuda:12.0.0-runtime-ubuntu20.04
530 12.1 docker pull nvidia/cuda:12.1.0-runtime-ubuntu20.04
535 12.2 docker pull nvidia/cuda:12.2.0-runtime-ubuntu20.04

所在服务器安装对应的显卡驱动(一般都有了),需要使用对应的 cuda image与其对应

对外prover docker统一使用ubuntu 20.04

  • base:从 CUDA 9.0 开始,包含部署预构建 CUDA 应用程序的最低限度(libcudart)。如果您想手动选择要安装的 CUDA 软件包,请使用此映像。
  • runtime:通过添加 CUDA 工具包中的所有共享库来扩展基础映像。如果您有使用多个 CUDA 库的预构建应用程序,请使用此映像。
  • devel:通过添加编译器工具链、调试工具、标头和静态库来扩展运行时映像。使用此映像从源代码编译 CUDA 应用程序

数据来源:https://hub.docker.com/r/nvidia/cuda/tags

proof没有按顺序生成问题分析和解决

问题描述

当prover异常,或者某些原因导致各个服务不稳定时,prover_jobs 卡在in_gpu_proof 状态,跳过的证明无法自恢复进行补全。

测试数据

主节点

zksync_local=# select l1_batch_number, status, created_at from prover_jobs order by created_at desc;
 l1_batch_number |    status    |         created_at         
-----------------+--------------+----------------------------
            1147 | successful   | 2023-09-06 05:58:18.199706
            1146 | successful   | 2023-09-06 02:57:31.196466
            1145 | successful   | 2023-09-06 02:19:07.253418
            1144 | successful   | 2023-09-06 01:46:06.18188
            1143 | in_gpu_proof | 2023-09-05 11:27:41.230814
            1142 | in_gpu_proof | 2023-09-05 11:27:37.413038
            1141 | in_gpu_proof | 2023-09-05 11:27:31.75103
            1140 | in_gpu_proof | 2023-09-05 11:27:28.767585
            1139 | in_gpu_proof | 2023-09-05 11:27:23.971785

扩展节点

zksync_local=# select l1_batch_number, status, created_at from prover_jobs order by created_at desc;
 l1_batch_number |    status    |         created_at         
-----------------+--------------+----------------------------
            1147 | in_gpu_proof | 2023-09-07 10:59:47.841166
            1146 | in_gpu_proof | 2023-09-07 10:59:46.586542
            1145 | in_gpu_proof | 2023-09-07 10:59:45.218178
            1144 | in_gpu_proof | 2023-09-07 10:59:43.862459
            1143 | in_gpu_proof | 2023-09-07 10:59:42.610455
            1142 | in_gpu_proof | 2023-09-07 10:59:41.344974
            1141 | in_gpu_proof | 2023-09-07 10:59:39.956309
            1140 | in_gpu_proof | 2023-09-07 10:59:38.589833
            1139 | in_gpu_proof | 2023-09-07 10:59:37.219

跟进代码

先确认下当前zkSync是否有自恢复的逻辑

  1. 查看数据操作core/lib/dal/src/prover_dal.rs,找到自恢复数据库操作requeue_stuck_jobs
  2. 找到具体执行服务模块housekeeper,并以默认加载
    #[arg(
         long,
         default_value = "api,tree,eth,data_fetcher,state_keeper,witness_generator,housekeeper"
     )]
     components: ComponentsToRun,
    2023-09-08T12:43:43.561336Z  INFO zksync_core: Starting the components: [HttpApi, WsApi, ExplorerApi, Tree, EthWatcher, EthTxAggregator, EthTxManager, DataFetcher, StateKeeper, WitnessGenerator(None, BasicCircuits), WitnessGenerator(None, LeafAggregation), WitnessGenerator(None, NodeAggregation), WitnessGenerator(None, Scheduler), Housekeeper]
  3. housekeeper初始化
    2023-09-08T12:43:47.244008Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: GcsBlobCleaner with frequency: 60000 ms
    2023-09-08T12:43:47.244008Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: WitnessGeneratorStatsReporter with frequency: 10000 ms
    2023-09-08T12:43:47.244016Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: GpuProverQueueMonitor with frequency: 10000 ms
    2023-09-08T12:43:47.244037Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: ProverStatsReporter with frequency: 5000 ms
    2023-09-08T12:43:47.244048Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: WaitingToQueuedWitnessJobMover with frequency: 30000 ms
    2023-09-08T12:43:47.244050Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: ProverJobRetryManager with frequency: 300000 ms
    2023-09-08T12:43:47.244030Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: L1BatchMetricsReporter with frequency: 10000 ms
    2023-09-08T12:43:47.244171Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: FriProverJobRetryManager with frequency: 30000 ms
    2023-09-08T12:43:47.244281Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: WaitingToQueuedFriWitnessJobMover with frequency: 40000 ms
    2023-09-08T12:43:47.244293Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: SchedulerCircuitQueuer with frequency: 40000 ms
    2023-09-08T12:43:47.244294Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: WitnessGeneratorStatsReporter with frequency: 10000 ms
    2023-09-08T12:43:47.244300Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: FriWitnessGeneratorJobRetryManager with frequency: 30000 ms
    2023-09-08T12:43:47.244296Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: FriProverStatsReporter with frequency: 30000 ms
    此时zksync-server日志循环等待新证明
    2023-09-08T12:43:48.092403Z  INFO zksync_core::eth_sender::opside_send_plug::manage_proof_plug: submit_pending_proofs finish
    2023-09-08T12:43:48.092417Z  INFO zksync_core::eth_sender::opside_send_plug::manage_proof_plug: manage proof start try_fetch_proof_to_send
    2023-09-08T12:43:48.092433Z  INFO zksync_core::eth_sender::opside_send_plug::manage_proof_plug: manage proof: process_resend receiver for start signal
    2023-09-08T12:43:48.445424Z  INFO zksync_core::eth_sender::opside_send_plug::manage_proof_plug: try_fetch_proof_to_send: check dest block:681 commit
    2023-09-08T12:44:08.927334Z  INFO zksync_core::eth_sender::opside_send_plug::manage_proof_plug: sequence_batch_info.block_number == 0 at batch:681
  4. 判断housekeeper检查服务是否正常被执行
    async fn run_routine_task(&mut self) {
     let stuck_jobs = self
         .prover_connection_pool
         .access_storage()
         .await
         .prover_dal()
         .requeue_stuck_jobs(self.processing_timeout, self.max_attempts)
         .await;
     let job_len = stuck_jobs.len();
     for stuck_job in stuck_jobs {
         vlog::info!("re-queuing prover job {:?}", stuck_job);
     }
     // vlog::info!("server.prover.requeued_jobs{}", job_len as u64);
     // vlog::info!("server.prover.self.processing_timeout{}", self.processing_timeout.as_secs() as u64);
     // vlog::info!("server.prover.self.max_attempts{}", self.max_attempts as u64);
     metrics::counter!("server.prover.requeued_jobs", job_len as u64);
    }
    self.processing_timeout : 2700s (45分钟)
    self.max_attempts: 1

    经过测试,run_routine_task被正常定期执行,但是没有执行任务

    2023-09-08T13:13:17.988849Z  INFO zksync_core::house_keeper::prover_job_retry_manager: server.prover.requeued_jobs0
    2023-09-08T13:13:17.988864Z  INFO zksync_core::house_keeper::prover_job_retry_manager: server.prover.self.processing_timeout2700
    2023-09-08T13:13:17.988873Z  INFO zksync_core::house_keeper::prover_job_retry_manager: server.prover.self.max_attempts1
  5. 查看run_routine_task具体数据库操作
    UPDATE prover_jobs
    SET status = 'queued', attempts = attempts + 1, updated_at = now(), processing_started_at = now()
    WHERE (status = 'in_progress' AND  processing_started_at <= now() - $1::interval AND attempts < $2)
    OR (status = 'in_gpu_proof' AND  processing_started_at <= now() - $1::interval AND attempts < $2)
    OR (status = 'failed' AND attempts < $2)
    RETURNING id, status, attempts

    对照681数据

总结分析

对于zkSync本身存在自恢复逻辑housekeeper->run->run_routine_task->requeue_stuck_jobs
然后根据self.processing_timeout, self.max_attempts两个配置参数,进行数据库查询

self.processing_timeout : 2700s (45分钟)
self.max_attempts: 1
UPDATE prover_jobs
SET status = 'queued', attempts = attempts + 1, updated_at = now(), processing_started_at = now()
WHERE (status = 'in_progress' AND  processing_started_at <= now() - $1::interval AND attempts < $2)
OR (status = 'in_gpu_proof' AND  processing_started_at <= now() - $1::interval AND attempts < $2)
OR (status = 'failed' AND attempts < $2)
RETURNING id, status, attempts

初始任务状态为 queued,当任务成功下发到prover时,状态变更为in_gpu_proof,如果当前任务没有在设置时间内processing_timeout完成,则将当前状态改回 queued,重新下发prover(可能是其他prover)
注意:当状态变更为in_gpu_proof时,attempts 会加1
问题:由于默认max_attempts等于1,所以无法进行run_routine_task
解决:增大max_attempts,5
同时需要注意processing_timeout,应大于支持最低显卡类型,单显卡,运行的时间,否则可能导致算力浪费。

PROVER_NON_GPU_MAX_ATTEMPTS=5
PROVER_NON_GPU_GENERATION_TIMEOUT_IN_SECS=3600

方案验证

修改环境变量后

2023-09-08T14:00:44.990770Z  INFO zksync_core::house_keeper::prover_job_retry_manager: re-queuing prover job StuckProverJobs { id: 1149, status: "queued", attempts: 2 }
  1. 相应in_gpu_proof任务状态更新为queued
  2. 链接的prover开始proof生成