- 修改 .gitmodules 文件中对应模块的url属性;
- 使用 git submodule sync 命令,将新的URL更新到文件.git/config;
- 再使用命令初始化子模块:git submodule init
- 最后使用命令更新子模块:git submodule update
Git 修改.Submodule文件 url 生效
zkSync 根据l2到l1跨链hash查询CommitBlocks,PublishProofBlocksOnchain,ExecuteBlocks各个L1层交易hash
跨链交易发起:0x0bbbcf153f17ec0e1f12d698bbc64f9d242bc1cd1312dd5f34febf0e6cb6601a
- 根据
tx_hash
查看表l2_to_l1_logs
,得到所在的miniblock_number
为15841
- 根据上面的
miniblock_number
,根据number
查看表miniblocksselect * from miniblocks where number=15841;
- 得到当前跨链交易所在
l1_batch_number
为7670 - 根据
l1_batch_number
,以number
查看l1_batches
表select number,is_finished,eth_commit_tx_id,eth_prove_tx_id,eth_execute_tx_id from l1_batches where number=7670;
得到
number | is_finished | eth_commit_tx_id | eth_prove_tx_id | eth_execute_tx_id --------+-------------+------------------+-----------------+------------------- 7670 | t | 41073 | 41148 | 41151
- 根据eth_commit_tx_id查询eth_txs,得到CommitBlocks对应交易信息
select nonce,contract_address,tx_type,has_failed,sent_at_block,tx_status,confirmed_eth_tx_history_id from eth_txs where id=41073 ORDER BY updated_at DESC limit 1;
nonce | contract_address | tx_type | has_failed | sent_at_block | tx_status | confirmed_eth_tx_history_id -------+--------------------------------------------+--------------+------------+---------------+-----------+----------------------------- 41091 | 0x5e3e5f6ef0e21f0cf5b4c3acd3cf29740b1cbbd8 | CommitBlocks | f | | Done | 43194
- 根据
confirmed_eth_tx_history_id
得到CommitBlocks对应交易hashselect eth_tx_id,tx_hash,confirmed_at from eth_txs_history where id=43194 ORDER BY updated_at DESC limit 10;
eth_tx_id | tx_hash | confirmed_at -----------+--------------------------------------------------------------------+---------------------------- 41073 | 0x0b01e199877faef52b95477119f53bf546a2915bc903132331f41542e58da53d | 2023-09-18 03:50:19.059576
- 同理查询
eth_prove_tx_id
和eth_execute_tx_id
得到对应的交易hash
zkSync Era宕机问题排查
问题背景
从一些外部消息得知
9月12日消息,据zkSync Era区块链浏览器显示,zkSync Era主网疑似出现宕机情况,zkSync Era提交给以太坊的最新批次为#208455,时间为14:14,区块高度已暂停于#13641404,已暂停出块37分钟。
跟进缘由
由于现有部分项目基于zkSync Era,所以需要确认下问题起因是什么,是否存在官方新版修复,目前现有版本会不会同样存在问题
确认问题
先从浏览器数据,确认下,是否存在消息描述问题,以及分析下问题位置(区块浏览器/链节点)
确认下Batch高度时间
Batch高度 | 区块时间 | 链接 | 位置 |
---|---|---|---|
208455 | 2023-09-12 14:14 | https://explorer.zksync.io/batch/208455 | 前 |
208456 | 2023-09-12 14:14 | https://explorer.zksync.io/batch/208456 | 后 |
208457 | 2023-09-12 14:15 | https://explorer.zksync.io/batch/208457 | 后+1 |
208455 与 208456 相差时间符合预期
确认下Block高度时间
Block高度 | Batch高度 | Committed时间 | 链接 | 位置 |
---|---|---|---|---|
13641404 | 208456 | 2023-09-12 14:14 | https://explorer.zksync.io/block/13641404 | 前 |
13641405 | 208456 | 2023-09-12 14:14 | https://explorer.zksync.io/block/13641405 | 后 |
13641406 | 208456 | 2023-09-12 14:14 | https://explorer.zksync.io/block/13641406 | 后+1 |
确认下Batch Commit时间
Batch高度 | Commit时间 | Commit tx hash | 位置 |
---|---|---|---|
208455 | Sep-12-2023 06:15:59 AM +UTC | https://etherscan.io/tx/0x369446bc9d99087aa1160d426b7af372dce91bb7d372724b7c529f2e3ff30ecd | 前 |
208456 | Sep-12-2023 06:16:35 AM +UTC | https://etherscan.io/tx/0x902b3b0eee2e82ef048e8de8ec0417d7875c0930b5b0b893de48f8b5b59f8944 | 后 |
208457 | Sep-12-2023 06:17:59 AM +UTC | https://etherscan.io/tx/0x7ce6d03ead9117a0a7268042c6e19637c702df2382070f04442df75602461661 | 后+1 |
分析结果
从节点Batch和Block生成时间,以及Batch Commit,对比消息中的#208455
前后时间差,综合来看,链方面数据无宕机,
大概率是当时区块浏览器服务方面,或者连接的某些提供数据RPC节点,出现了区块同步不及时问题。
如何避免
浏览器和提供查询的RPC节点做多个主备灾备, 实时高度状态检查和线路自动切换
显卡驱动与CUDA版本对照
Driver Version | CUDA Version | docker image |
---|---|---|
510 | 11.6 | docker pull nvidia/cuda:11.6.2-runtime-ubuntu20.04 |
525 | 12.0 | docker pull nvidia/cuda:12.0.0-runtime-ubuntu20.04 |
530 | 12.1 | docker pull nvidia/cuda:12.1.0-runtime-ubuntu20.04 |
535 | 12.2 | docker pull nvidia/cuda:12.2.0-runtime-ubuntu20.04 |
所在服务器安装对应的显卡驱动(一般都有了),需要使用对应的 cuda image与其对应
对外prover docker统一使用ubuntu 20.04
- base:从 CUDA 9.0 开始,包含部署预构建 CUDA 应用程序的最低限度(libcudart)。如果您想手动选择要安装的 CUDA 软件包,请使用此映像。
- runtime:通过添加 CUDA 工具包中的所有共享库来扩展基础映像。如果您有使用多个 CUDA 库的预构建应用程序,请使用此映像。
- devel:通过添加编译器工具链、调试工具、标头和静态库来扩展运行时映像。使用此映像从源代码编译 CUDA 应用程序
proof没有按顺序生成问题分析和解决
问题描述
当prover异常,或者某些原因导致各个服务不稳定时,prover_jobs 卡在in_gpu_proof 状态,跳过的证明无法自恢复进行补全。
测试数据
主节点
zksync_local=# select l1_batch_number, status, created_at from prover_jobs order by created_at desc;
l1_batch_number | status | created_at
-----------------+--------------+----------------------------
1147 | successful | 2023-09-06 05:58:18.199706
1146 | successful | 2023-09-06 02:57:31.196466
1145 | successful | 2023-09-06 02:19:07.253418
1144 | successful | 2023-09-06 01:46:06.18188
1143 | in_gpu_proof | 2023-09-05 11:27:41.230814
1142 | in_gpu_proof | 2023-09-05 11:27:37.413038
1141 | in_gpu_proof | 2023-09-05 11:27:31.75103
1140 | in_gpu_proof | 2023-09-05 11:27:28.767585
1139 | in_gpu_proof | 2023-09-05 11:27:23.971785
扩展节点
zksync_local=# select l1_batch_number, status, created_at from prover_jobs order by created_at desc;
l1_batch_number | status | created_at
-----------------+--------------+----------------------------
1147 | in_gpu_proof | 2023-09-07 10:59:47.841166
1146 | in_gpu_proof | 2023-09-07 10:59:46.586542
1145 | in_gpu_proof | 2023-09-07 10:59:45.218178
1144 | in_gpu_proof | 2023-09-07 10:59:43.862459
1143 | in_gpu_proof | 2023-09-07 10:59:42.610455
1142 | in_gpu_proof | 2023-09-07 10:59:41.344974
1141 | in_gpu_proof | 2023-09-07 10:59:39.956309
1140 | in_gpu_proof | 2023-09-07 10:59:38.589833
1139 | in_gpu_proof | 2023-09-07 10:59:37.219
跟进代码
先确认下当前zkSync是否有自恢复的逻辑
- 查看数据操作core/lib/dal/src/prover_dal.rs,找到自恢复数据库操作requeue_stuck_jobs
- 找到具体执行服务模块housekeeper,并以默认加载
#[arg( long, default_value = "api,tree,eth,data_fetcher,state_keeper,witness_generator,housekeeper" )] components: ComponentsToRun,
2023-09-08T12:43:43.561336Z INFO zksync_core: Starting the components: [HttpApi, WsApi, ExplorerApi, Tree, EthWatcher, EthTxAggregator, EthTxManager, DataFetcher, StateKeeper, WitnessGenerator(None, BasicCircuits), WitnessGenerator(None, LeafAggregation), WitnessGenerator(None, NodeAggregation), WitnessGenerator(None, Scheduler), Housekeeper]
- housekeeper初始化
2023-09-08T12:43:47.244008Z INFO zksync_core::house_keeper::periodic_job: Starting periodic job: GcsBlobCleaner with frequency: 60000 ms 2023-09-08T12:43:47.244008Z INFO zksync_core::house_keeper::periodic_job: Starting periodic job: WitnessGeneratorStatsReporter with frequency: 10000 ms 2023-09-08T12:43:47.244016Z INFO zksync_core::house_keeper::periodic_job: Starting periodic job: GpuProverQueueMonitor with frequency: 10000 ms 2023-09-08T12:43:47.244037Z INFO zksync_core::house_keeper::periodic_job: Starting periodic job: ProverStatsReporter with frequency: 5000 ms 2023-09-08T12:43:47.244048Z INFO zksync_core::house_keeper::periodic_job: Starting periodic job: WaitingToQueuedWitnessJobMover with frequency: 30000 ms 2023-09-08T12:43:47.244050Z INFO zksync_core::house_keeper::periodic_job: Starting periodic job: ProverJobRetryManager with frequency: 300000 ms 2023-09-08T12:43:47.244030Z INFO zksync_core::house_keeper::periodic_job: Starting periodic job: L1BatchMetricsReporter with frequency: 10000 ms 2023-09-08T12:43:47.244171Z INFO zksync_core::house_keeper::periodic_job: Starting periodic job: FriProverJobRetryManager with frequency: 30000 ms 2023-09-08T12:43:47.244281Z INFO zksync_core::house_keeper::periodic_job: Starting periodic job: WaitingToQueuedFriWitnessJobMover with frequency: 40000 ms 2023-09-08T12:43:47.244293Z INFO zksync_core::house_keeper::periodic_job: Starting periodic job: SchedulerCircuitQueuer with frequency: 40000 ms 2023-09-08T12:43:47.244294Z INFO zksync_core::house_keeper::periodic_job: Starting periodic job: WitnessGeneratorStatsReporter with frequency: 10000 ms 2023-09-08T12:43:47.244300Z INFO zksync_core::house_keeper::periodic_job: Starting periodic job: FriWitnessGeneratorJobRetryManager with frequency: 30000 ms 2023-09-08T12:43:47.244296Z INFO zksync_core::house_keeper::periodic_job: Starting periodic job: FriProverStatsReporter with frequency: 30000 ms
此时zksync-server日志循环等待新证明 2023-09-08T12:43:48.092403Z INFO zksync_core::eth_sender::opside_send_plug::manage_proof_plug: submit_pending_proofs finish 2023-09-08T12:43:48.092417Z INFO zksync_core::eth_sender::opside_send_plug::manage_proof_plug: manage proof start try_fetch_proof_to_send 2023-09-08T12:43:48.092433Z INFO zksync_core::eth_sender::opside_send_plug::manage_proof_plug: manage proof: process_resend receiver for start signal 2023-09-08T12:43:48.445424Z INFO zksync_core::eth_sender::opside_send_plug::manage_proof_plug: try_fetch_proof_to_send: check dest block:681 commit 2023-09-08T12:44:08.927334Z INFO zksync_core::eth_sender::opside_send_plug::manage_proof_plug: sequence_batch_info.block_number == 0 at batch:681
- 判断housekeeper检查服务是否正常被执行
async fn run_routine_task(&mut self) { let stuck_jobs = self .prover_connection_pool .access_storage() .await .prover_dal() .requeue_stuck_jobs(self.processing_timeout, self.max_attempts) .await; let job_len = stuck_jobs.len(); for stuck_job in stuck_jobs { vlog::info!("re-queuing prover job {:?}", stuck_job); } // vlog::info!("server.prover.requeued_jobs{}", job_len as u64); // vlog::info!("server.prover.self.processing_timeout{}", self.processing_timeout.as_secs() as u64); // vlog::info!("server.prover.self.max_attempts{}", self.max_attempts as u64); metrics::counter!("server.prover.requeued_jobs", job_len as u64); }
self.processing_timeout : 2700s (45分钟) self.max_attempts: 1
经过测试,run_routine_task被正常定期执行,但是没有执行任务
2023-09-08T13:13:17.988849Z INFO zksync_core::house_keeper::prover_job_retry_manager: server.prover.requeued_jobs0 2023-09-08T13:13:17.988864Z INFO zksync_core::house_keeper::prover_job_retry_manager: server.prover.self.processing_timeout2700 2023-09-08T13:13:17.988873Z INFO zksync_core::house_keeper::prover_job_retry_manager: server.prover.self.max_attempts1
- 查看run_routine_task具体数据库操作
UPDATE prover_jobs SET status = 'queued', attempts = attempts + 1, updated_at = now(), processing_started_at = now() WHERE (status = 'in_progress' AND processing_started_at <= now() - $1::interval AND attempts < $2) OR (status = 'in_gpu_proof' AND processing_started_at <= now() - $1::interval AND attempts < $2) OR (status = 'failed' AND attempts < $2) RETURNING id, status, attempts
对照681数据
总结分析
对于zkSync本身存在自恢复逻辑housekeeper->run->run_routine_task->requeue_stuck_jobs
然后根据self.processing_timeout, self.max_attempts两个配置参数,进行数据库查询
self.processing_timeout : 2700s (45分钟)
self.max_attempts: 1
UPDATE prover_jobs
SET status = 'queued', attempts = attempts + 1, updated_at = now(), processing_started_at = now()
WHERE (status = 'in_progress' AND processing_started_at <= now() - $1::interval AND attempts < $2)
OR (status = 'in_gpu_proof' AND processing_started_at <= now() - $1::interval AND attempts < $2)
OR (status = 'failed' AND attempts < $2)
RETURNING id, status, attempts
初始任务状态为 queued,当任务成功下发到prover时,状态变更为in_gpu_proof,如果当前任务没有在设置时间内processing_timeout完成,则将当前状态改回 queued,重新下发prover(可能是其他prover)
注意:当状态变更为in_gpu_proof时,attempts 会加1
问题:由于默认max_attempts等于1,所以无法进行run_routine_task
解决:增大max_attempts,5
同时需要注意processing_timeout,应大于支持最低显卡类型,单显卡,运行的时间,否则可能导致算力浪费。
PROVER_NON_GPU_MAX_ATTEMPTS=5
PROVER_NON_GPU_GENERATION_TIMEOUT_IN_SECS=3600
方案验证
修改环境变量后
2023-09-08T14:00:44.990770Z INFO zksync_core::house_keeper::prover_job_retry_manager: re-queuing prover job StuckProverJobs { id: 1149, status: "queued", attempts: 2 }
- 相应in_gpu_proof任务状态更新为queued
- 链接的prover开始proof生成