您正在查看: Surou 发布的文章

显卡驱动与CUDA版本对照

Driver Version CUDA Version docker image
510 11.6 docker pull nvidia/cuda:11.6.2-runtime-ubuntu20.04
525 12.0 docker pull nvidia/cuda:12.0.0-runtime-ubuntu20.04
530 12.1 docker pull nvidia/cuda:12.1.0-runtime-ubuntu20.04
535 12.2 docker pull nvidia/cuda:12.2.0-runtime-ubuntu20.04

所在服务器安装对应的显卡驱动(一般都有了),需要使用对应的 cuda image与其对应

对外prover docker统一使用ubuntu 20.04

  • base:从 CUDA 9.0 开始,包含部署预构建 CUDA 应用程序的最低限度(libcudart)。如果您想手动选择要安装的 CUDA 软件包,请使用此映像。
  • runtime:通过添加 CUDA 工具包中的所有共享库来扩展基础映像。如果您有使用多个 CUDA 库的预构建应用程序,请使用此映像。
  • devel:通过添加编译器工具链、调试工具、标头和静态库来扩展运行时映像。使用此映像从源代码编译 CUDA 应用程序

数据来源:https://hub.docker.com/r/nvidia/cuda/tags

proof没有按顺序生成问题分析和解决

问题描述

当prover异常,或者某些原因导致各个服务不稳定时,prover_jobs 卡在in_gpu_proof 状态,跳过的证明无法自恢复进行补全。

测试数据

主节点

zksync_local=# select l1_batch_number, status, created_at from prover_jobs order by created_at desc;
 l1_batch_number |    status    |         created_at         
-----------------+--------------+----------------------------
            1147 | successful   | 2023-09-06 05:58:18.199706
            1146 | successful   | 2023-09-06 02:57:31.196466
            1145 | successful   | 2023-09-06 02:19:07.253418
            1144 | successful   | 2023-09-06 01:46:06.18188
            1143 | in_gpu_proof | 2023-09-05 11:27:41.230814
            1142 | in_gpu_proof | 2023-09-05 11:27:37.413038
            1141 | in_gpu_proof | 2023-09-05 11:27:31.75103
            1140 | in_gpu_proof | 2023-09-05 11:27:28.767585
            1139 | in_gpu_proof | 2023-09-05 11:27:23.971785

扩展节点

zksync_local=# select l1_batch_number, status, created_at from prover_jobs order by created_at desc;
 l1_batch_number |    status    |         created_at         
-----------------+--------------+----------------------------
            1147 | in_gpu_proof | 2023-09-07 10:59:47.841166
            1146 | in_gpu_proof | 2023-09-07 10:59:46.586542
            1145 | in_gpu_proof | 2023-09-07 10:59:45.218178
            1144 | in_gpu_proof | 2023-09-07 10:59:43.862459
            1143 | in_gpu_proof | 2023-09-07 10:59:42.610455
            1142 | in_gpu_proof | 2023-09-07 10:59:41.344974
            1141 | in_gpu_proof | 2023-09-07 10:59:39.956309
            1140 | in_gpu_proof | 2023-09-07 10:59:38.589833
            1139 | in_gpu_proof | 2023-09-07 10:59:37.219

跟进代码

先确认下当前zkSync是否有自恢复的逻辑

  1. 查看数据操作core/lib/dal/src/prover_dal.rs,找到自恢复数据库操作requeue_stuck_jobs
  2. 找到具体执行服务模块housekeeper,并以默认加载
    #[arg(
         long,
         default_value = "api,tree,eth,data_fetcher,state_keeper,witness_generator,housekeeper"
     )]
     components: ComponentsToRun,
    2023-09-08T12:43:43.561336Z  INFO zksync_core: Starting the components: [HttpApi, WsApi, ExplorerApi, Tree, EthWatcher, EthTxAggregator, EthTxManager, DataFetcher, StateKeeper, WitnessGenerator(None, BasicCircuits), WitnessGenerator(None, LeafAggregation), WitnessGenerator(None, NodeAggregation), WitnessGenerator(None, Scheduler), Housekeeper]
  3. housekeeper初始化
    2023-09-08T12:43:47.244008Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: GcsBlobCleaner with frequency: 60000 ms
    2023-09-08T12:43:47.244008Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: WitnessGeneratorStatsReporter with frequency: 10000 ms
    2023-09-08T12:43:47.244016Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: GpuProverQueueMonitor with frequency: 10000 ms
    2023-09-08T12:43:47.244037Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: ProverStatsReporter with frequency: 5000 ms
    2023-09-08T12:43:47.244048Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: WaitingToQueuedWitnessJobMover with frequency: 30000 ms
    2023-09-08T12:43:47.244050Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: ProverJobRetryManager with frequency: 300000 ms
    2023-09-08T12:43:47.244030Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: L1BatchMetricsReporter with frequency: 10000 ms
    2023-09-08T12:43:47.244171Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: FriProverJobRetryManager with frequency: 30000 ms
    2023-09-08T12:43:47.244281Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: WaitingToQueuedFriWitnessJobMover with frequency: 40000 ms
    2023-09-08T12:43:47.244293Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: SchedulerCircuitQueuer with frequency: 40000 ms
    2023-09-08T12:43:47.244294Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: WitnessGeneratorStatsReporter with frequency: 10000 ms
    2023-09-08T12:43:47.244300Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: FriWitnessGeneratorJobRetryManager with frequency: 30000 ms
    2023-09-08T12:43:47.244296Z  INFO zksync_core::house_keeper::periodic_job: Starting periodic job: FriProverStatsReporter with frequency: 30000 ms
    此时zksync-server日志循环等待新证明
    2023-09-08T12:43:48.092403Z  INFO zksync_core::eth_sender::opside_send_plug::manage_proof_plug: submit_pending_proofs finish
    2023-09-08T12:43:48.092417Z  INFO zksync_core::eth_sender::opside_send_plug::manage_proof_plug: manage proof start try_fetch_proof_to_send
    2023-09-08T12:43:48.092433Z  INFO zksync_core::eth_sender::opside_send_plug::manage_proof_plug: manage proof: process_resend receiver for start signal
    2023-09-08T12:43:48.445424Z  INFO zksync_core::eth_sender::opside_send_plug::manage_proof_plug: try_fetch_proof_to_send: check dest block:681 commit
    2023-09-08T12:44:08.927334Z  INFO zksync_core::eth_sender::opside_send_plug::manage_proof_plug: sequence_batch_info.block_number == 0 at batch:681
  4. 判断housekeeper检查服务是否正常被执行
    async fn run_routine_task(&mut self) {
     let stuck_jobs = self
         .prover_connection_pool
         .access_storage()
         .await
         .prover_dal()
         .requeue_stuck_jobs(self.processing_timeout, self.max_attempts)
         .await;
     let job_len = stuck_jobs.len();
     for stuck_job in stuck_jobs {
         vlog::info!("re-queuing prover job {:?}", stuck_job);
     }
     // vlog::info!("server.prover.requeued_jobs{}", job_len as u64);
     // vlog::info!("server.prover.self.processing_timeout{}", self.processing_timeout.as_secs() as u64);
     // vlog::info!("server.prover.self.max_attempts{}", self.max_attempts as u64);
     metrics::counter!("server.prover.requeued_jobs", job_len as u64);
    }
    self.processing_timeout : 2700s (45分钟)
    self.max_attempts: 1

    经过测试,run_routine_task被正常定期执行,但是没有执行任务

    2023-09-08T13:13:17.988849Z  INFO zksync_core::house_keeper::prover_job_retry_manager: server.prover.requeued_jobs0
    2023-09-08T13:13:17.988864Z  INFO zksync_core::house_keeper::prover_job_retry_manager: server.prover.self.processing_timeout2700
    2023-09-08T13:13:17.988873Z  INFO zksync_core::house_keeper::prover_job_retry_manager: server.prover.self.max_attempts1
  5. 查看run_routine_task具体数据库操作
    UPDATE prover_jobs
    SET status = 'queued', attempts = attempts + 1, updated_at = now(), processing_started_at = now()
    WHERE (status = 'in_progress' AND  processing_started_at <= now() - $1::interval AND attempts < $2)
    OR (status = 'in_gpu_proof' AND  processing_started_at <= now() - $1::interval AND attempts < $2)
    OR (status = 'failed' AND attempts < $2)
    RETURNING id, status, attempts

    对照681数据

总结分析

对于zkSync本身存在自恢复逻辑housekeeper->run->run_routine_task->requeue_stuck_jobs
然后根据self.processing_timeout, self.max_attempts两个配置参数,进行数据库查询

self.processing_timeout : 2700s (45分钟)
self.max_attempts: 1
UPDATE prover_jobs
SET status = 'queued', attempts = attempts + 1, updated_at = now(), processing_started_at = now()
WHERE (status = 'in_progress' AND  processing_started_at <= now() - $1::interval AND attempts < $2)
OR (status = 'in_gpu_proof' AND  processing_started_at <= now() - $1::interval AND attempts < $2)
OR (status = 'failed' AND attempts < $2)
RETURNING id, status, attempts

初始任务状态为 queued,当任务成功下发到prover时,状态变更为in_gpu_proof,如果当前任务没有在设置时间内processing_timeout完成,则将当前状态改回 queued,重新下发prover(可能是其他prover)
注意:当状态变更为in_gpu_proof时,attempts 会加1
问题:由于默认max_attempts等于1,所以无法进行run_routine_task
解决:增大max_attempts,5
同时需要注意processing_timeout,应大于支持最低显卡类型,单显卡,运行的时间,否则可能导致算力浪费。

PROVER_NON_GPU_MAX_ATTEMPTS=5
PROVER_NON_GPU_GENERATION_TIMEOUT_IN_SECS=3600

方案验证

修改环境变量后

2023-09-08T14:00:44.990770Z  INFO zksync_core::house_keeper::prover_job_retry_manager: re-queuing prover job StuckProverJobs { id: 1149, status: "queued", attempts: 2 }
  1. 相应in_gpu_proof任务状态更新为queued
  2. 链接的prover开始proof生成

zkSync block revert

异常场景

注意:无法回滚已执行的高度,否则报Attempt to revert already executed blocks
此时,强制停止 prover,并发送交易, 制造落后场景

查询最新数据

zk f cargo run --bin block_reverter --release -- print-suggested-values --json

返回数据中包含几个信息

  • committed 最新高度
  • verified 已验证高度
  • executed 本地已执行高度
  • last_executed_l1_batch_number: 推荐回滚高度
  • nonce: 执行交易nonce
  • priority_fee: 推荐fee
    2023-09-13T10:25:08.901399Z  INFO zksync_core::block_reverter: Last L1 batch numbers on contract: committed 1171, verified 1170, executed 1170
    {"last_executed_l1_batch_number":1170,"nonce":3867,"priority_fee":1000000000}

    此时链上最新高度为committed:1171

测试回滚

先停止下zksync_server/zksync_external_node

合约回滚

1171回滚到1170

cd $ZKSYNC_HOME && zk f cargo run --bin block_reverter --release -- send-eth-transaction --l1-batch-number 1170 --nonce 3871  --priority-fee-per-gas 1000000000

此时会向合约发送revertBlocks交易,无需手动执行合约
此时再次查询

zk f cargo run --bin block_reverter --release -- print-suggested-values --json

此时链上最新高度为committed:1170

本地数据库回滚

注意:需要先停止下zksync_server/zksync_external_node, 否则报 ./db/main/tree/LOCK

cd $ZKSYNC_HOME && zk f cargo run --bin block_reverter --release -- rollback-db --l1-batch-number 1170 --rollback-postgres --rollback-tree --rollback-sk-cache

附加

core/bin/zksync_core/src/bin/block_reverter.rs
Command

  • print-suggested-values
    显示要使用的建议值。
  • send-eth-transaction
    将恢复事务发送到 L1。
  • rollback-db
    将内部数据库状态恢复到前一个块。
  • clear-failed-transactions
    清除失败的 L1 事务。

测试用例

core/tests/revert-test/tests/revert-and-restart.test.ts
const executedProcess = await utils.exec(
    'cd $ZKSYNC_HOME && ' +
        'RUST_LOG=off cargo run --bin block_reverter --release -- print-suggested-values --json'
    // ^ Switch off logs to not pollute the output JSON
);
const suggestedValuesOutput = executedProcess.stdout;
const { lastL1BatchNumber, nonce, priorityFee } = parseSuggestedValues(suggestedValuesOutput);
expect(lastL1BatchNumber < blocksCommittedBeforeRevert, 'There should be at least one block for revert').to.be
    .true;

console.log(
    Reverting with parameters: last unreverted L1 batch number: ${lastL1BatchNumber}, nonce: ${nonce}, priorityFee: ${priorityFee}
);

console.log('Sending ETH transaction..');
await utils.spawn(
    cd $ZKSYNC_HOME && cargo run --bin block_reverter --release -- send-eth-transaction --l1-batch-number ${lastL1BatchNumber} --nonce ${nonce} --priority-fee-per-gas ${priorityFee}
);

console.log('Rolling back DB..');
await utils.spawn(
    cd $ZKSYNC_HOME && cargo run --bin block_reverter --release -- rollback-db --l1-batch-number ${lastL1BatchNumber} --rollback-postgres --rollback-tree --rollback-sk-cache
);

let blocksCommitted = await mainContract.getTotalBlocksCommitted();
expect(blocksCommitted.eq(lastL1BatchNumber), 'Revert on contract was unsuccessful').to.be.true;

zkSync revert block后,扩展节点启动报错

问题背景

前面测试了zkSync的revert block, 然后在别的同步的节点,同步时报以下错误

2023-09-14T12:46:52.271426230Z 2023-09-14T12:46:52.271274Z  INFO zksync_core::sync_layer::fetcher: New batch: 3205. Timestamp: 1694603829
2023-09-14T12:46:52.271447030Z 2023-09-14T12:46:52.271296Z  INFO zksync_core::sync_layer::fetcher: New miniblock: 6891 / 8176
2023-09-14T12:46:52.385460894Z 2023-09-14T12:46:52.385324Z  INFO zksync_core::consistency_checker: Batch 3195 is consistent with L1
2023-09-14T12:46:52.408615713Z 2023-09-14T12:46:52.408488Z  INFO zksync_core::consistency_checker: Checking commit tx 0xa07f…353d for batch 3196
2023-09-14T12:46:52.526929804Z 2023-09-14T12:46:52.526783Z  INFO zksync_core::sync_layer::fetcher: New miniblock: 6892 / 8176
2023-09-14T12:46:52.726526535Z thread 'tokio-runtime-worker' panicked at 'index out of bounds: the len is 1 but the index is 1', /usr/src/zksync/core/bin/zksync_core/src/consistency_checker/mod.rs:111:27
2023-09-14T12:46:52.726566126Z stack backtrace:
2023-09-14T12:46:52.744782978Z    0:     0x5637f01373ea - std::backtrace_rs::backtrace::libunwind::trace::h79937bc171ada62c
2023-09-14T12:46:52.744804819Z                                at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
2023-09-14T12:46:52.744811549Z    1:     0x5637f01373ea - std::backtrace_rs::backtrace::trace_unsynchronized::h2292bca8571cb919
2023-09-14T12:46:52.744816889Z                                at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2023-09-14T12:46:52.744822060Z    2:     0x5637f01373ea - std::sys_common::backtrace::_print_fmt::h9c461f248e4ae90d
2023-09-14T12:46:52.744827090Z                                at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/sys_common/backtrace.rs:65:5
2023-09-14T12:46:52.744832010Z    3:     0x5637f01373ea - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::he9fe6bf1a39182e1
2023-09-14T12:46:52.744837660Z                                at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/sys_common/backtrace.rs:44:22
2023-09-14T12:46:52.746411727Z    4:     0x5637f015caee - core::fmt::write::h032658c119c720d7
2023-09-14T12:46:52.746431598Z                                at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/fmt/mod.rs:1208:17
2023-09-14T12:46:52.746438008Z    5:     0x5637f0131875 - std::io::Write::write_fmt::h299fc90dfae41c0d
2023-09-14T12:46:52.746443078Z                                at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/io/mod.rs:1682:15
2023-09-14T12:46:52.746448138Z    6:     0x5637f01371b5 - std::sys_common::backtrace::_print::heb70d25df9937e3f
2023-09-14T12:46:52.746453018Z                                at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/sys_common/backtrace.rs:47:5
2023-09-14T12:46:52.746473369Z    7:     0x5637f01371b5 - std::sys_common::backtrace::print::had745c0a76b8b521
2023-09-14T12:46:52.746492989Z                                at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/sys_common/backtrace.rs:34:9
2023-09-14T12:46:52.746513250Z    8:     0x5637f0138e0f - std::panicking::default_hook::{{closure}}::h1ea782cdfa2fd097
2023-09-14T12:46:52.746530751Z                                at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:267:22
2023-09-14T12:46:52.746567952Z    9:     0x5637f0138b4b - std::panicking::default_hook::h1cc3af63455a163c
2023-09-14T12:46:52.746574652Z                                at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:286:9
2023-09-14T12:46:52.746619113Z   10:     0x5637f013951c - std::panicking::rust_panic_with_hook::h5cafdc4b3bfd5528
2023-09-14T12:46:52.746638164Z                                at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:688:13
2023-09-14T12:46:52.746644624Z   11:     0x5637f01392b9 - std::panicking::begin_panic_handler::{{closure}}::hf31c60f40775892c
2023-09-14T12:46:52.746649704Z                                at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:579:13
2023-09-14T12:46:52.746690595Z   12:     0x5637f013789c - std::sys_common::backtrace::__rust_end_short_backtrace::h28a5c7be595826cd
2023-09-14T12:46:52.746696666Z                                at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/sys_common/backtrace.rs:137:18
2023-09-14T12:46:52.746702006Z   13:     0x5637f0138fc2 - rust_begin_unwind
2023-09-14T12:46:52.746712406Z                                at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:575:5
2023-09-14T12:46:52.746735127Z   14:     0x5637ee6f3ba3 - core::panicking::panic_fmt::h8fa27a0b37dd98b7
2023-09-14T12:46:52.746740467Z                                at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/panicking.rs:64:14
2023-09-14T12:46:52.746760237Z   15:     0x5637ee6f3cf2 - core::panicking::panic_bounds_check::hd27fa6e100ea4568
2023-09-14T12:46:52.746792558Z                                at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/panicking.rs:147:5
2023-09-14T12:46:52.746802909Z   16:     0x5637eead1f9a - zksync_core::consistency_checker::ConsistencyChecker::run::{{closure}}::h449bcdaefdc936fd
2023-09-14T12:46:52.746839150Z   17:     0x5637eeb35af5 - tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut::h91ab53e454932f64
2023-09-14T12:46:52.746864871Z   18:     0x5637eebe0dbd - tokio::runtime::task::core::Core<T,S>::poll::h795cb0b9385c1329
2023-09-14T12:46:52.746885391Z   19:     0x5637eeb4df5e - tokio::runtime::task::harness::Harness<T,S>::poll::hc2ab6dba6d5ba4ae
2023-09-14T12:46:52.746909382Z   20:     0x5637f00eefee - tokio::runtime::scheduler::multi_thread::worker::Context::run_task::h6fb307488dc375ee
2023-09-14T12:46:52.746937683Z   21:     0x5637f00ee143 - tokio::runtime::scheduler::multi_thread::worker::Context::run::h175658cfae89590d
2023-09-14T12:46:52.746969794Z   22:     0x5637f00da9b9 - tokio::macros::scoped_tls::ScopedKey<T>::set::h2b771be14cbef94d
2023-09-14T12:46:52.746992634Z   23:     0x5637f00ede49 - tokio::runtime::scheduler::multi_thread::worker::run::ha39c9ec4dce0c89d
2023-09-14T12:46:52.747062136Z   24:     0x5637f00f8158 - tokio::runtime::task::core::Core<T,S>::poll::h6feb9b8dd5ca027e
2023-09-14T12:46:52.747090037Z   25:     0x5637f00ce6ff - tokio::runtime::task::harness::Harness<T,S>::poll::hb45689cdecd9b901
2023-09-14T12:46:52.747152079Z   26:     0x5637f00dc338 - tokio::runtime::blocking::pool::Inner::run::hc09fcd48a7633fbf
2023-09-14T12:46:52.747159409Z   27:     0x5637f00dd82a - std::sys_common::backtrace::__rust_begin_short_backtrace::hf8e4ebb56e2acb86
2023-09-14T12:46:52.747167950Z   28:     0x5637f00f050b - core::ops::function::FnOnce::call_once{{vtable.shim}}::h4834ddb4b804c368
2023-09-14T12:46:52.747235692Z   29:     0x5637f013c093 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::hb77d8d72ebcf79c4
2023-09-14T12:46:52.747246432Z                                at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/alloc/src/boxed.rs:2000:9
2023-09-14T12:46:52.747251792Z   30:     0x5637f013c093 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::hc08c3353e1568487
2023-09-14T12:46:52.747270633Z                                at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/alloc/src/boxed.rs:2000:9
2023-09-14T12:46:52.747282563Z   31:     0x5637f013c093 - std::sys::unix::thread::Thread::new::thread_start::h7168e596cd5e5ce6
2023-09-14T12:46:52.747294573Z                                at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/sys/unix/thread.rs:108:17
2023-09-14T12:46:52.747489009Z   32:     0x7f5cddf78fa3 - start_thread
2023-09-14T12:46:52.747671095Z   33:     0x7f5cddd2006f - clone
2023-09-14T12:46:52.747677365Z   34:                0x0 - <unknown>
2023-09-14T12:46:52.764575588Z 2023-09-14T12:46:52.764414Z  INFO zksync_core::sync_layer::fetcher: New batch: 3206. Timestamp: 1694603945
2023-09-14T12:46:52.764596038Z 2023-09-14T12:46:52.764439Z  INFO zksync_core::sync_layer::fetcher: New miniblock: 6893 / 8176
2023-09-14T12:46:53.009199118Z 2023-09-14T12:46:53.009043Z  INFO zksync_core::sync_layer::fetcher: New miniblock: 6894 / 8176
2023-09-14T12:46:53.066490673Z 2023-09-14T12:46:53.066347Z  INFO zksync_core::sync_layer::batch_status_updater: Batch 3202: committed
2023-09-14T12:46:53.066924736Z 2023-09-14T12:46:53.066789Z  INFO zksync_core::sync_layer::batch_status_updater: Commit status change: number 3202, hash 0xa07f…353d, happened at 2023-09-07 10:46:55.753857 UTC
2023-09-14T12:46:53.109595076Z 2023-09-14T12:46:53.109454Z  INFO zksync_core::sync_layer::external_io: Sealing the batch
2023-09-14T12:46:53.109620647Z 2023-09-14T12:46:53.109471Z DEBUG zksync_core::state_keeper::keeper: L1 batch #3203 should be sealed unconditionally as per sealing rules
2023-09-14T12:46:53.113694988Z 2023-09-14T12:46:53.113561Z  INFO zksync_core::state_keeper::io::seal_logic: Sealing miniblock 6888 (L1 batch 3203) with 0 (0 L2 + 0 L1) txs, 1 events, 13 reads, 4 writes
2023-09-14T12:46:53.157928775Z 2023-09-14T12:46:53.157781Z DEBUG zksync_core::state_keeper::io::seal_logic: miniblock execution stage insert_storage_logs took 42.366601ms with count Some(4)
2023-09-14T12:46:53.161347507Z 2023-09-14T12:46:53.161215Z DEBUG zksync_core::state_keeper::io::seal_logic: sealed miniblock 6888 in 47.653779ms
2023-09-14T12:46:53.161386708Z 2023-09-14T12:46:53.161237Z DEBUG zksync_core::state_keeper::io::seal_logic: L1 batch execution stage fictive_miniblock took 47.68931ms with count None
2023-09-14T12:46:53.161400299Z 2023-09-14T12:46:53.161341Z  INFO zksync_core::state_keeper::io::seal_logic: Sealing L1 batch 3203 with 1 (1 L2 + 0 L1) txs, 1 l2_l1_logs, 4 events, 90 reads (21 deduped), 18 writes (5 deduped)
2023-09-14T12:46:53.209996525Z 2023-09-14T12:46:53.209850Z DEBUG zksync_core::state_keeper::io::seal_logic: L1 batch execution stage insert_protective_reads took 44.407542ms with count Some(21)
2023-09-14T12:46:53.211452098Z 2023-09-14T12:46:53.211323Z DEBUG zksync_core::state_keeper::io::seal_logic: sealed l1 batch 3203 in 98.10967ms
2023-09-14T12:46:53.211476579Z 2023-09-14T12:46:53.211362Z  INFO zksync_core::sync_layer::external_io: Batch 3203 is sealed
2023-09-14T12:46:53.211484819Z 2023-09-14T12:46:53.211377Z DEBUG zksync_core::sync_layer::external_io: Waiting for the new batch params
2023-09-14T12:46:53.211490459Z 2023-09-14T12:46:53.211385Z  INFO zksync_core::sync_layer::external_io: Getting previous L1 batch hash
2023-09-14T12:46:53.240026509Z 2023-09-14T12:46:53.239805Z  INFO zksync_core::sync_layer::fetcher: New batch: 3207. Timestamp: 1694604066
2023-09-14T12:46:53.240053310Z 2023-09-14T12:46:53.239824Z  INFO zksync_core::sync_layer::fetcher: New miniblock: 6895 / 8176
2023-09-14T12:46:53.263598861Z 2023-09-14T12:46:53.263459Z  INFO zksync_core::metadata_calculator::updater: Loading blocks with numbers 3203..=3203 to update Merkle tree
2023-09-14T12:46:53.267278250Z 2023-09-14T12:46:53.267144Z  INFO zksync_core::metadata_calculator::updater: Processing L1 batches #3203..=3203 with 26 total logs
2023-09-14T12:46:53.267494317Z 2023-09-14T12:46:53.267361Z  INFO zksync_merkle_tree::domain: Extending Merkle tree with batch #3203 with 26 ops in full mode
2023-09-14T12:46:53.273107584Z 2023-09-14T12:46:53.272966Z  INFO zksync_merkle_tree::domain: Processed batch #3203; root hash is 0x7f78…d4a0, 9078 leaves in total, 1 initial writes, 4 repeated writes
2023-09-14T12:46:53.275036721Z 2023-09-14T12:46:53.274902Z  INFO zksync_core::metadata_calculator::updater: Saved witnesses for L1 batch #3203 to object storage at `merkel_tree_paths_3203.bin`
2023-09-14T12:46:53.276841885Z 2023-09-14T12:46:53.276649Z  INFO zksync_core::metadata_calculator::updater: Updated metadata for L1 batch #3203 in Postgres
2023-09-14T12:46:53.276882016Z 2023-09-14T12:46:53.276733Z  INFO zksync_merkle_tree::domain: Flushing L1 batches #[3203] to RocksDB
2023-09-14T12:46:53.277228036Z 2023-09-14T12:46:53.277107Z  INFO zksync_core::metadata_calculator::metrics: L1 batches #3203..=3203 processed in tree
2023-09-14T12:46:53.314666571Z 2023-09-14T12:46:53.314511Z  INFO zksync_core::sync_layer::external_io: Previous L1 batch hash: 57655874197922256061346220318188548076559386516802029495849130083054353765536
2023-09-14T12:46:53.340176950Z 2023-09-14T12:46:53.340044Z DEBUG zksync_state::rocksdb: loading storage for l1 batch number 3203
2023-09-14T12:46:53.340206721Z 2023-09-14T12:46:53.340091Z DEBUG zksync_state::rocksdb: loading state changes for l1 batch 3203
2023-09-14T12:46:53.341542481Z 2023-09-14T12:46:53.341421Z DEBUG zksync_state::rocksdb: loading factory deps for l1 batch 3203
2023-09-14T12:46:53.342740826Z 2023-09-14T12:46:53.342505Z  INFO zksync_core::state_keeper::batch_executor: Secondary storage for batch 3204 initialized, size is 9101
2023-09-14T12:46:53.342764707Z 2023-09-14T12:46:53.342543Z DEBUG zksync_core::sync_layer::external_io: Waiting for the new tx, next action is Some(Tx(Transaction(0x5f4f2a2fb69447e75db65b222e50f30a0b439c882223ce6e890a8cf5c46d55ff)))
2023-09-14T12:46:53.342772307Z 2023-09-14T12:46:53.342578Z  INFO zksync_core::state_keeper::batch_executor: Starting executing batch #3204
2023-09-14T12:46:53.352470206Z 2023-09-14T12:46:53.352342Z  INFO zksync_core::sync_layer::external_io: Sealing miniblock
2023-09-14T12:46:53.352495827Z 2023-09-14T12:46:53.352360Z DEBUG zksync_core::state_keeper::keeper: Miniblock #6889 (L1 batch #3204) should be sealed as per sealing rules
2023-09-14T12:46:53.355504267Z 2023-09-14T12:46:53.355373Z  INFO zksync_core::state_keeper::io::seal_logic: Sealing miniblock 6889 (L1 batch 3204) with 1 (1 L2 + 0 L1) txs, 3 events, 77 reads, 14 writes
2023-09-14T12:46:53.401962459Z 2023-09-14T12:46:53.401823Z DEBUG zksync_core::state_keeper::io::seal_logic: miniblock execution stage insert_storage_logs took 42.33107ms with count Some(14)
2023-09-14T12:46:53.405522575Z 2023-09-14T12:46:53.405396Z DEBUG zksync_core::state_keeper::io::seal_logic: sealed miniblock 6889 in 50.024698ms
2023-09-14T12:46:53.406345130Z 2023-09-14T12:46:53.406143Z  INFO zksync_core::sync_layer::external_io: Miniblock 6890 is sealed
2023-09-14T12:46:53.406393921Z 2023-09-14T12:46:53.406163Z DEBUG zksync_core::state_keeper::keeper: Initialized new miniblock #6890 (L1 batch #3204) with timestamp 2023-09-13 11:16:06 UTC
2023-09-14T12:46:53.406401451Z 2023-09-14T12:46:53.406180Z DEBUG zksync_core::sync_layer::external_io: Waiting for the new tx, next action is Some(SealBatch)
2023-09-14T12:46:53.472626763Z 2023-09-14T12:46:53.472431Z  INFO zksync_core::sync_layer::fetcher: New miniblock: 6896 / 8176
2023-09-14T12:46:53.708788983Z 2023-09-14T12:46:53.708623Z  INFO zksync_core::sync_layer::fetcher: New batch: 3208. Timestamp: 1694604126
2023-09-14T12:46:53.708815224Z 2023-09-14T12:46:53.708642Z  INFO zksync_core::sync_layer::fetcher: New miniblock: 6897 / 8176
2023-09-14T12:46:53.903700525Z 2023-09-14T12:46:53.903557Z  INFO zksync_external_node: initialized ETH-TxManager in 4.725550326s
2023-09-14T12:46:53.913371443Z 2023-09-14T12:46:53.913245Z  INFO zksync_external_node: initializing Opside send plug
2023-09-14T12:46:53.936266844Z 2023-09-14T12:46:53.936049Z  INFO zksync_external_node: initialized Opside send plug4.758046524s
2023-09-14T12:46:53.936291305Z 2023-09-14T12:46:53.936056Z  INFO zksync_external_node: initializing Opside manage plug
2023-09-14T12:46:53.936298395Z 2023-09-14T12:46:53.936070Z  INFO zksync_core::eth_sender::opside_send_plug::send_proof_plug: start handle_caches_event loop
2023-09-14T12:46:53.936304385Z 2023-09-14T12:46:53.936109Z  INFO zksync_core::eth_sender::opside_send_plug::send_proof_plug: start handle_monit_tx_event_loop loop
2023-09-14T12:46:53.936309966Z 2023-09-14T12:46:53.936112Z  INFO zksync_core::eth_sender::opside_send_plug::send_proof_plug: handle_chans_event: start handle_chans_event loop
2023-09-14T12:46:53.943476709Z 2023-09-14T12:46:53.943336Z  INFO zksync_core::eth_sender::opside_send_plug::manage_proof_plug: manage proof start try_fetch_proof_to_send
2023-09-14T12:46:53.943501999Z 2023-09-14T12:46:53.943373Z  INFO zksync_core::eth_sender::opside_send_plug::manage_proof_plug: manage proof: process_resend wait for start signal
2023-09-14T12:46:53.943512100Z 2023-09-14T12:46:53.943444Z  INFO zksync_core::eth_sender::opside_send_plug::send_proof_plug: start into handle history proof txs
2023-09-14T12:46:53.944207140Z 2023-09-14T12:46:53.943940Z  INFO zksync_utils::wait_for_tasks: One of the tokio actors unexpectedly finished with error: index out of bounds: the len is 1 but the index is 1
2023-09-14T12:46:53.944242241Z 2023-09-14T12:46:53.943968Z  INFO zksync_storage::db: Waiting for all the RocksDB instances to be dropped, 2 remaining
2023-09-14T12:46:53.944249652Z 2023-09-14T12:46:53.943986Z  INFO zksync_core::metadata_calculator::updater: Stop signal received, metadata_calculator is shutting down
2023-09-14T12:46:53.944290583Z 2023-09-14T12:46:53.944210Z  INFO zksync_core::api_server::web3: Stop signal received, web3 HTTP JSON RPC API is shutting down
2023-09-14T12:46:53.944302103Z 2023-09-14T12:46:53.944224Z  INFO zksync_core::api_server::web3: Stop signal received, WS JSON RPC API is shutting down
2023-09-14T12:46:53.945073626Z 2023-09-14T12:46:53.944947Z  INFO zksync_storage::db: Waiting for all the RocksDB instances to be dropped, 1 remaining
2023-09-14T12:46:53.945101567Z 2023-09-14T12:46:53.945013Z  INFO zksync_core::sync_layer::fetcher: New miniblock: 6898 / 8176
2023-09-14T12:46:53.945111027Z 2023-09-14T12:46:53.945033Z  INFO zksync_core::sync_layer::fetcher: Stop signal received, exiting the fetcher routine
2023-09-14T12:46:53.949720505Z 2023-09-14T12:46:53.949609Z  INFO zksync_core::eth_sender::opside_send_plug::send_proof_plug: finish handle history proof txs
2023-09-14T12:46:53.949748195Z 2023-09-14T12:46:53.949638Z  INFO zksync_core::eth_sender::opside_send_plug::send_proof_plug: Stop signal received, eth_tx_aggregator is shutting down
2023-09-14T12:46:54.135835514Z 2023-09-14T12:46:54.135694Z  INFO zksync_core::api_server::web3::pubsub_notifier: Stop signal received, pubsub_tx_notifier is shutting down
2023-09-14T12:46:54.136480683Z 2023-09-14T12:46:54.136390Z  INFO zksync_core::api_server::web3::pubsub_notifier: Stop signal received, pubsub_block_notifier is shutting down
2023-09-14T12:46:54.141903295Z 2023-09-14T12:46:54.141774Z  INFO zksync_core::api_server::web3::pubsub_notifier: Stop signal received, pubsub_logs_notifier is shutting down
2023-09-14T12:46:54.218539427Z 2023-09-14T12:46:54.218338Z  INFO zksync_core::eth_sender::eth_tx_manager: Stop signal received, eth_tx_manager is shutting down
2023-09-14T12:46:54.358294949Z 2023-09-14T12:46:54.358130Z  INFO zksync_core::l1_gas_price::main_node_fetcher: Stop signal received, MainNodeGasPriceFetcher is shutting down
2023-09-14T12:46:54.417717566Z 2023-09-14T12:46:54.417570Z  INFO zksync_core::state_keeper::keeper: Stop signal received, state keeper is shutting down
2023-09-14T12:46:54.417754358Z 2023-09-14T12:46:54.417607Z  INFO zksync_core::state_keeper::batch_executor: State keeper exited with an unfinished batch
2023-09-14T12:46:54.420950213Z 2023-09-14T12:46:54.420815Z  INFO zksync_storage::db: All the RocksDB instances are dropped
2023-09-14T12:46:55.500606305Z 2023-09-14T12:46:55.500432Z  INFO zksync_core::eth_sender::opside_send_plug::manage_proof_plug: get proofBatchNumFinal == 0 &&  proofHashBatchNumFinal == 0 not do commit pending proof
2023-09-14T12:46:55.500631526Z 2023-09-14T12:46:55.500460Z  INFO zksync_core::eth_sender::opside_send_plug::manage_proof_plug: manage proof start try_fetch_proof_to_

关键错误信息

2023-09-14T12:46:52.726526535Z thread 'tokio-runtime-worker' panicked at 'index out of bounds: the len is 1 but the index is 1', /usr/src/zksync/core/bin/zksync_core/src/consistency_checker/mod.rs:111:27

相关代码分析

 async fn check_commitments(&self, batch_number: L1BatchNumber) -> Result<bool, error::Error> {
        let mut storage = self.db.access_storage().await;

        let storage_block = storage
            .blocks_dal()
            .get_storage_block(batch_number)
            .await
            .unwrap_or_else(|| panic!("Block {} not found in the database", batch_number));

        let commit_tx_id = storage_block
            .eth_commit_tx_id
            .unwrap_or_else(|| panic!("Block commit tx not found for block {}", batch_number))
            as u32;

        let block_metadata = storage
            .blocks_dal()
            .get_block_with_metadata(storage_block)
            .await
            .unwrap_or_else(|| {
                panic!(
                    "Block metadata for block {} not found in the database",
                    batch_number
                )
            });

        let commit_tx_hash = storage
            .eth_sender_dal()
            .get_confirmed_tx_hash_by_eth_tx_id(commit_tx_id)
            .await
            .unwrap_or_else(|| {
                panic!(
                    "Commit tx hash not found in the database. Commit tx id: {}",
                    commit_tx_id
                )
            });

        vlog::info!(
            "Checking commit tx {} for batch {}",
            commit_tx_hash,
            batch_number.0
        );

        // we can't get tx calldata from db because it can be fake
        let commit_tx = self
            .web3
            .eth()
            .transaction(TransactionId::Hash(commit_tx_hash))
            .await?
            .expect("Commit tx not found on L1");

        let commit_tx_status = self
            .web3
            .eth()
            .transaction_receipt(commit_tx_hash)
            .await?
            .expect("Commit tx receipt not found on L1")
            .status;

        assert_eq!(
            commit_tx_status,
            Some(1.into()),
            "Main node gave us a failed commit tx"
        );

        let commitments = self
            .contract
            .function("commitBlocks")
            .unwrap()
            .decode_input(&commit_tx.input.0[4..])
            .unwrap()
            .pop()
            .unwrap()
            .into_array()
            .unwrap();

        // Commit transactions usually publish multiple commitments at once, so we need to find
        // the one that corresponds to the batch we're checking.
        let first_batch_number = match &commitments[0] {
            ethabi::Token::Tuple(tuple) => tuple[0].clone().into_uint().unwrap().as_usize(),
            _ => panic!("ABI does not match the commitBlocks() function on the zkSync contract"),
        };
        let commitment = &commitments[batch_number.0 as usize - first_batch_number];

        Ok(commitment == &block_metadata.l1_commit_data())
    }

check_commitments作用主要是根据指定的L1BatchNumber,

  • 从l1_batches中获取对应的Block信息,然后获得eth_commit_tx_id字段->commit_tx_id
  • 从factory_deps等数据里获得->block_metadata
  • 根据commit_tx_ideth_txs_history数据表中得到->commit_tx_hash->0xa07f1bfcb6826c3db3b7d6cc3d69db47e3ceb17008c15dcd3bd1e7ceb10b353d
  • 通过RPCeth_getTransactionByHash查询commit_tx_hash得到->commit_tx
  • 通过RPCeth_getTransactionReceipt查询commit_tx_hash得到->commit_tx_status
  • 检查commit_tx_status是否正确
  • 通过commitBlocksABI decode commit_tx.input得到->commitments
  • 获取commitments[0]得到->first_batch_number
  • 根据batch_numberfirst_batch_number获取commitments对应位置数据

此时报异常:'index out of bounds: the len is 1 but the index is 1',差值1
由于batch_number==3196,所以first_batch_number等于3195,并且commitments只有一个数据

// TODO

zkSync 优化L1交易提交和处理

问题描述

长时间没有向L1提交proof数据 问题解决后,发现目前定序服务发送commitblock的速度远远大于提交proofhash和proveblocks的速度。导致last_verified_batch和sequenced_batch差距持续增大。

问题定位

因为提交proofhash和proveblocks需要有交易确认等待期,先看下交易状态的处理逻辑以及关注如何实现各个阶段的加速处理,以及降低压力数据(commitblock)的产生频率,从而达到稳定的平衡

降低压力数据(commitblock)的产生频率

优先处理下,减少sequencer提交压力数据增加幅度

CHAIN_STATE_KEEPER_TRANSACTION_SLOTS=500
CHAIN_STATE_KEEPER_BLOCK_COMMIT_DEADLINE_MS=300000

5分钟提交一次