我看了附件的修订稿。现在稿件已经把“1×1 pixel token、multiscale convolution、Head-Domain Attention、CAB、轻量化复杂度”都写进去了，但主要问题是：创新点仍然像并列堆叠，而不是一个由同一矛盾推导出来的完整框架。目前摘要、引言贡献点、Related Work 末段、Motivation、Methodology 开头、Ablation 都需要一起改；只补 Fig.1/Fig.2 不够。附件中 Fig.1 主要是在比较传统 MHA 与 LMCAN，Fig.2 主要说明 1×1 token 的谱保真优势，Fig.3 才是整体网络结构，Fig.4 是 CAB 机制，因此 Fig.1 和 Fig.2 目前不能独立回答“为什么这些组件必须组合成一个框架”这个审稿意见。(LMCAN_v2.pdf)

一、建议把创新点串成这一条主线

不要把创新写成：

multiscale convolution + head-domain attention + center attention + lightweight

而要写成：

HS–MS fusion requires pixel-level spectral fidelity. Therefore, LMCAN keeps 1×1 pixel tokens instead of patch merging. However, pixel tokens reduce the effective receptive field under window attention. To resolve this spectral-fidelity/spatial-context trade-off, LMCAN first injects multiscale receptive fields before attention and assigns them to different heads. Since these scale-aware heads would otherwise remain independent in conventional MHA, Head-Domain Attention is introduced to model cross-scale head correlations. Finally, because window attention still misses neighborhoods across window boundaries, CAB performs center-based local compensation.

也就是：

1×1 pixel token 保谱 → receptive field 不足 → pre-attention multiscale heads 扩感受野 → HAB 让多尺度 head 发生交互 → CAB 修补窗口边界邻域缺失 → 形成一个面向 HS–MS fusion 的统一 shallow multiscale attention framework。

这个逻辑需要从摘要、引言、Related Work、Motivation、Methodology、Ablation 反复一致地出现。当前稿件已经在引言贡献点后写了“not a simple combination”的段落，但它更像声明，缺少“问题—推导—对应设计—证据”的闭环。(LMCAN_v2.pdf)

二、必须修改的段落位置

位置	必须修改的原因	修改方向
Abstract	现在摘要已经列出模块，但“为什么这些模块相互关联”仍不够强	改成围绕一个核心矛盾：spectral fidelity vs. spatial context
Introduction 最后两段	现在贡献点是模块罗列	改成“一个框架、三个耦合设计”
Contributions	必须重写	每条贡献都要说明它解决上一条设计引出的什么问题
Related Work B 末段	现在有 gap，但语言偏泛，且 “some studies…” 不够尖锐	明确现有方法缺少“pixel-token-preserving + cross-scale-head interaction + boundary compensation”的统一设计
Motivation	当前 Fig.2 后的推导方向正确，但需要更明确地把 Fig.2 与三个模块串起来	把 Motivation 写成设计原则，而不是单独解释 1×1 token 和 CAB
Methodology A 开头	现在说网络有三个 components，容易被审稿人理解为拼装	改成“the architecture implements the above design principle”
MMH / CAB 小节开头	需要明确接口关系：MSC 生成 scale-aware heads，HAB 负责 head-domain communication，CAB 是 MMH 后的 boundary compensation	不要把 CAB 写成独立补丁
Fig.1 / Fig.2 caption 和正文引用	目前图不能充分说明完整框架	Fig.1 改成“problem-to-solution framework”，Fig.2 改成定量支撑谱保真
Ablation Study	当前消融差异很小，审稿人会觉得创新点贡献不充分	需要补 cumulative ablation、cross-scale/head affinity 可视化、boundary-region 指标
Conclusion	现在仍是模块罗列	改成重申统一框架和设计闭环

三、可直接替换的英文段落

1. Rewritten Abstract

可直接替换原摘要。当前摘要虽然提到 scale-specific features、Head-Domain Attention 和 CAB，但还没有把它们明确写成“由同一矛盾驱动的统一框架”。(LMCAN_v2.pdf)

Abstract— Hyperspectral–multispectral (HS–MS) image fusion aims to reconstruct a high-resolution hyperspectral image by combining the rich spectral signatures of low-resolution hyperspectral observations with the fine spatial structures of high-resolution multispectral images. A key challenge in Transformer-based fusion is the trade-off between spectral fidelity and spatial context modeling: patch merging and deep hierarchical designs enlarge the receptive field but may mix pixel-wise spectral signatures and increase computational cost, whereas 1 × 1 pixel tokens preserve spectra but limit contextual modeling within window attention. To address this trade-off, we propose LMCAN, a Lightweight Multi-Scale Centralized Attention Network that formulates HS–MS fusion as a unified pixel-token-preserving multiscale attention framework. Specifically, LMCAN first introduces pre-attention multiscale convolution to generate scale-specific features while maintaining pixel-level spectral representations. These features are explicitly assigned to different attention heads, so that each head models dependencies under a distinct receptive field. To avoid independent processing among scale-aware heads, a Head-Domain Attention module is further developed to learn adaptive head-to-head correlations and enable cross-scale communication within a shallow attention layer. Finally, a Center Attention Block is designed as a boundary-compensation mechanism to recover neighborhood interactions missed by fixed window partitioning. The proposed design avoids hierarchical patch merging, preserves spectral integrity, and achieves multiscale spatial–spectral interaction with low computational complexity. Experiments on CAVE, Harvard, Chikusei, and Pavia datasets demonstrate that LMCAN achieves superior reconstruction accuracy and efficiency compared with representative conventional, CNN-based, and Transformer-based fusion methods.

2. Rewritten Introduction problem paragraph

建议替换 Introduction 中从 “However, most Transformer-based fusion models…” 到 “To overcome these challenges…” 这一段。现在原文已经指出 deep stacking、patch merging、cross-scale interaction 和 window boundary 问题，但它们之间的因果关系还不够强。(LMCAN_v2.pdf)

Existing Transformer-based HS–MS fusion methods usually enlarge the receptive field by deep stacking, shifted-window operations, or hierarchical patch merging. Although these strategies are effective for natural image restoration, they are not fully aligned with the spectral-fidelity requirement of HS–MS fusion. In hyperspectral data, each spatial pixel corresponds to a high-dimensional spectral signature; therefore, merging neighboring pixels into larger tokens may introduce artificial spectral mixing. A straightforward solution is to preserve 1 × 1 pixel tokens, but this choice narrows the effective receptive field of window-based attention and weakens multiscale spatial context modeling. This reveals a fundamental dilemma: HS–MS fusion requires pixel-level spectral preservation and multiscale spatial interaction simultaneously, while conventional hierarchical Transformers usually improve one at the expense of the other. Moreover, standard multi-head attention treats different heads as independent subspaces and fuses them only by concatenation and projection, which is insufficient for explicit cross-scale communication. Fixed window partitioning further causes missing neighborhood interactions near window boundaries, which is harmful to edge and texture reconstruction.

3. Rewritten transition paragraph before contributions

建议放在贡献点前，替换现有 “To overcome these challenges, we propose...” 段落。

To address the above dilemma, we propose LMCAN, a Lightweight Multi-Scale Centralized Attention Network for HS–MS image fusion. The central idea of LMCAN is not to independently add multiscale convolution, head interaction, and local refinement, but to organize them into a single pixel-token-preserving attention framework. First, LMCAN keeps the 1 × 1 token representation to avoid spectral mixing. Second, it injects multiscale receptive fields before attention and assigns scale-specific features to different heads, thereby compensating for the limited receptive field caused by pixel tokens without using patch merging. Third, Head-Domain Attention explicitly models correlations among these scale-aware heads, enabling cross-scale information exchange within a shallow structure. Finally, Center Attention is introduced after window attention to compensate for boundary-deficient neighborhoods. In this way, LMCAN jointly addresses spectral preservation, multiscale context modeling, cross-scale interaction, and boundary reconstruction under a lightweight design.

4. Rewritten Contributions

必须重写。建议不要写成“我们提出模块 A、模块 B、模块 C”，而要写成“我们提出一个统一机制；模块是机制中的不同环节”。当前贡献点虽然已经较清楚，但仍然像四条并列模块。(LMCAN_v2.pdf)

The main contributions of this work are summarized as follows:

A unified pixel-token-preserving multiscale attention framework for HS–MS fusion. We identify the spectral-fidelity/spatial-context trade-off in Transformer-based HS–MS fusion and propose a shallow attention framework that preserves 1 × 1 pixel tokens while introducing multiscale contextual modeling without hierarchical patch merging.
Scale-aware head construction before attention. Instead of splitting homogeneous feature channels as in conventional multi-head attention, LMCAN first extracts scale-specific features using parallel dilated convolutions and then assigns them to different heads. This design makes each head correspond to a distinct receptive field, allowing multiscale spatial–spectral modeling while maintaining pixel-level spectral signatures.
Head-Domain Attention for explicit cross-scale communication. To overcome the independence of conventional attention heads, we design a Head-Domain Attention module that learns adaptive head-to-head affinities among scale-aware heads. This enables information exchange across receptive fields before final feature projection and strengthens the coupling between multiscale feature extraction and attention-based fusion.
Center Attention for boundary-aware neighborhood compensation. We introduce a Center Attention Block that uses each reconstruction pixel as the query center and aggregates information from its spatial neighborhood. This mechanism compensates for the missing local interactions caused by fixed window partitioning and improves edge and boundary reconstruction without sacrificing spectral fidelity.
Comprehensive validation of accuracy–efficiency trade-off. Experiments on CAVE, Harvard, Chikusei, and Pavia datasets, together with complexity analysis, efficiency evaluation, and ablation studies, demonstrate that LMCAN achieves competitive or superior fusion performance with significantly lower model complexity than representative CNN-based and Transformer-based methods.

5. Rewritten Related Work gap paragraph

建议替换 Related Work B 最后一段。当前稿件已经写了 “two issues remain insufficiently addressed”，但还可以更聚焦。(LMCAN_v2.pdf)

Despite the progress of Transformer-based HS–MS fusion, existing methods still lack a unified design that simultaneously satisfies spectral fidelity, multiscale context modeling, and efficient feature interaction. Hierarchical Transformers usually obtain multiscale representations through repeated stacking or patch merging, which increases channel dimensionality and may compromise pixel-wise spectral signatures. Some recent studies introduce convolutional priors, inter-head communication, or local neighborhood attention, but these mechanisms are typically used as independent enhancements rather than as coupled components derived from the specific requirements of HS–MS fusion. In contrast, LMCAN starts from the need to preserve 1 × 1 pixel tokens and then addresses the resulting limitations in a progressive manner: multiscale receptive fields are injected before attention, scale-aware heads are explicitly correlated by Head-Domain Attention, and boundary-deficient neighborhoods are compensated by Center Attention. Therefore, the novelty of LMCAN lies in a unified shallow multiscale attention framework tailored to pixel-level spectral preservation, rather than in a simple combination of existing modules.

6. Rewritten Motivation section opening

建议替换 Motivation 的前半部分。现在 Fig.2 说明 1×1 token 更保谱，但需要把 Fig.2 明确变成整个框架的出发点。(LMCAN_v2.pdf)

The motivation of LMCAN comes from a central requirement of HS–MS fusion: the reconstructed HR-HSI should preserve the spectral signature of each spatial pixel while recovering fine spatial details. Conventional Transformer designs often use K × K patch tokens or patch merging to reduce computational cost and enlarge the receptive field. However, for hyperspectral imagery, a patch may contain pixels with different material compositions and thus different spectral responses. Aggregating such pixels into a single token inevitably mixes their spectral signatures, which may degrade spectral reconstruction. As illustrated in Fig. 2, the 1 × 1 pixel-token strategy better preserves spectral profiles than conventional patch-based tokenization, making it more suitable for HS–MS fusion.

However, preserving 1 × 1 pixel tokens introduces a new difficulty: the receptive field of window-based attention becomes limited, and long-range or multiscale spatial dependencies cannot be sufficiently captured by a shallow Transformer block. Directly adopting hierarchical patch merging would reintroduce spectral mixing and increase the model complexity. Therefore, LMCAN follows a different design principle: multiscale context should be introduced before attention without changing the pixel-token representation. This motivates the proposed multiscale convolution module, which assigns different receptive fields to different channel groups and converts conventional homogeneous attention heads into scale-aware heads.

Once scale-aware heads are constructed, another limitation arises. In standard multi-head attention, different heads are processed independently and interact only after concatenation and output projection. This is insufficient for HS–MS fusion, because features extracted at different receptive fields should communicate adaptively during fusion. We therefore introduce Head-Domain Attention to learn head-to-head correlations among scale-aware heads. Finally, since fixed window partitioning still misses some spatially adjacent pixels at window boundaries, Center Attention is employed as a boundary-compensation mechanism. These three designs are thus not independent additions, but consecutive solutions to the same spectral-fidelity-preserving multiscale modeling problem.

7. Rewritten Methodology A opening

建议替换 Network Architecture 中 “The network primarily consists of three components...” 附近。当前这句话会强化“组件拼装”的印象。(LMCAN_v2.pdf)

The architecture of LMCAN is designed to implement the above pixel-token-preserving multiscale attention principle. Given the upsampled LR-HSI and the HR-MSI, the network first forms a spatially aligned input that contains both complete spectral information and high-resolution spatial structures. Instead of using patch merging or hierarchical token downsampling, LMCAN keeps the spatial resolution unchanged throughout the feature interaction stage. The multiscale convolution module first converts the input into scale-aware pixel-token features. These features are then fed into the Feature Interaction Module, where MMH performs token-domain attention within each scale-aware head and Head-Domain Attention exchanges information across heads. After several MMH blocks, CAB further compensates for neighborhood information missed by fixed window partitioning. Therefore, the three modules play complementary roles within one framework: multiscale convolution constructs scale-aware heads, Head-Domain Attention couples these heads, and CAB restores boundary-local context.

8. Rewritten MMH explanatory paragraph

当前 MMH 公式部分是合理的，但建议在公式前加入下面这段，强调 MSC 和 HAB 的接口关系。Head-Domain Attention 目前已经写了 head-to-head affinity 和区别于 conventional MHA 的公式说明，但需要更早告诉审稿人“它为什么不是普通 head attention 的小改动”。(LMCAN_v2.pdf)

The MMH block is the core component that transforms multiscale convolutional features into an explicit cross-scale attention representation. Different from conventional MHA, where all heads are obtained by linearly splitting the same token sequence into homogeneous subspaces, the heads in MMH are generated from scale-specific features. Therefore, each head has a clear physical meaning: it describes spatial–spectral dependencies under a specific receptive field. Token-domain self-attention is first performed within each head to model dependencies at that scale. Then, Head-Domain Attention is applied across head descriptors to learn how different receptive fields should be adaptively combined. This design enables scale-specific modeling and cross-scale communication within the same shallow attention layer.

9. Rewritten CAB explanatory paragraph

当前 CAB 的定义较完整，但要强调它不是“另一个 attention 模块”，而是 window attention 的边界补偿机制。(LMCAN_v2.pdf)

CAB is introduced as a boundary-compensation step after MMH rather than as an independent refinement module. Although MMH provides multiscale interaction within window attention, fixed window partitioning may still prevent spatially adjacent pixels from attending to each other when they fall into different windows. This problem is particularly harmful in HS–MS fusion, where the reconstruction of a target pixel depends not only on global context but also on its immediate local neighborhood. CAB addresses this limitation by treating each reconstruction pixel as the query center and sampling keys and values from a center-aligned neighborhood. In this way, each pixel can directly aggregate the most relevant surrounding information, including boundary-crossing neighbors, while the 1 × 1 spectral representation remains unchanged.

10. Rewritten Fig.1 / Fig.2 captions and正文引用

Fig.1 建议重画，不建议保留现有结构作为 Fig.1

现在 Fig.1 只是“Conventional MHA vs LMCAN”，但审稿人关心的是“完整框架”和“组件关联”。建议新 Fig.1 画成 Problem-to-solution framework：

上排：
HS–MS fusion requirement → preserve pixel spectra → use 1×1 tokens

中排：
Problem caused by 1×1 tokens → limited receptive field → pre-attention multiscale heads

下排：
Problem caused by independent heads → HAB cross-scale communication → problem caused by window boundaries → CAB boundary compensation

右侧输出：
spectral fidelity + multiscale spatial context + low complexity

新的 caption：

Fig. 1. Design rationale of LMCAN. The proposed framework is derived from the spectral-fidelity requirement of HS–MS fusion. LMCAN first preserves 1 × 1 pixel tokens to avoid spectral mixing, then injects multiscale receptive fields before attention to compensate for the limited context of pixel tokens. Head-Domain Attention explicitly models cross-scale correlations among scale-aware heads, while Center Attention compensates for missing boundary neighborhoods caused by fixed window partitioning. These coupled designs form a unified shallow multiscale attention framework rather than a simple combination of independent modules.

Fig.2 可以保留，但需要加强定量信息

当前 Fig.2 的方向是对的，但图中曲线和说明需要更清晰：建议增加 “Average SAM / mean spectral error for n×n token and 1×1 token” 的数值，最好用 bar plot 或表格嵌入图中。当前 Fig.2 只说明“1×1 更保谱”，但没有直接连接 HAB/CAB。正文引用可以改成：

As shown in Fig. 2, patch-based tokenization introduces larger spectral deviations because each token aggregates multiple spatial pixels with potentially different spectral signatures. This observation motivates the use of 1 × 1 pixel tokens in LMCAN. However, this choice reduces the receptive field of window attention, which further motivates the subsequent multiscale head construction and head-domain interaction. Therefore, Fig. 2 provides the empirical basis for the first design decision of LMCAN, while Fig. 1 summarizes how the remaining modules are derived from this decision.

11. Rewritten Ablation Study section

当前 Table VI 的消融差距很小，例如 full model 49.04 dB，而 Exp.I/II/III 分别是 48.92、48.99、48.86 dB，审稿人可能认为模块贡献不显著。(LMCAN_v2.pdf) 建议增加 cumulative ablation，而不是只做 replacement ablation。

建议新增表格结构：

Variant	1×1 token	MSC	HAB	CAB
Baseline W-MSA	✓	×	×	×
+ MSC	✓	✓	×	×
+ MSC + HAB	✓	✓	✓	×
+ MSC + HAB + CAB	✓	✓	✓	✓

建议替换消融引言为：

To verify that the proposed modules are coupled components of a unified framework rather than independent enhancements, we conduct cumulative and replacement ablation studies on the CAVE dataset. Starting from a pixel-token window-attention baseline, we progressively add the multiscale convolution module, Head-Domain Attention, and Center Attention. This setting directly evaluates the proposed design logic: MSC compensates for the limited receptive field of 1 × 1 tokens, HAB enables communication among the resulting scale-aware heads, and CAB recovers boundary-local neighborhoods missed by fixed window partitioning.

建议替换消融分析为：

The cumulative results show that each component improves the fusion performance in a progressive manner. Adding MSC improves spatial context modeling by assigning different receptive fields to different heads. Introducing HAB further improves spectral–spatial reconstruction because the scale-aware heads are no longer processed independently; instead, their correlations are adaptively learned in the head domain. Finally, adding CAB improves boundary-sensitive reconstruction by allowing each target pixel to aggregate information from its center-aligned neighborhood. These results confirm that the performance gain of LMCAN comes from the coordinated interaction among the three designs, rather than from a simple increase in model capacity.

还建议增加两类补充实验：

Boundary-region evaluation：在边缘 mask 或 gradient mask 上单独计算 PSNR/SAM/ERGAS，证明 CAB 的必要性。
Head-affinity visualization：显示 HAB 学到的 H×H affinity matrix，证明不同 scale heads 确实发生了自适应交互，而不是公式上的“伪创新”。

12. Rewritten Conclusion

建议替换当前 Conclusion。当前结论仍然是“先 MSC，再 MMH，再 CAB”的模块罗列。(LMCAN_v2.pdf)

In this paper, we proposed LMCAN, a lightweight pixel-token-preserving multiscale attention framework for HS–MS image fusion. The proposed method is motivated by the trade-off between spectral fidelity and spatial context modeling in Transformer-based fusion. By preserving 1 × 1 pixel tokens, LMCAN avoids spectral mixing caused by patch merging. To compensate for the limited receptive field of pixel tokens, multiscale receptive fields are introduced before attention and assigned to different heads. Head-Domain Attention further enables adaptive cross-scale communication among these scale-aware heads, while Center Attention compensates for missing local neighborhoods near window boundaries. In this way, LMCAN integrates spectral preservation, multiscale spatial modeling, cross-scale interaction, and boundary refinement into a unified shallow architecture. Extensive experiments on simulated and remote sensing datasets demonstrate that LMCAN achieves a favorable balance between reconstruction accuracy, spectral fidelity, and computational efficiency.

四、还必须修改的其他部分

统一命名：文中同时出现 SL-MMHA、MMH、HAB、Head-Domain Attention、Center Attention、Centralized Attention。建议统一为：
MSC → MMH → HDA/HAB → CAB。标题中的 “Centralized Attention” 与正文 “Center Attention Block” 也要解释清楚，否则审稿人会觉得概念不一致。
明确 H 和 3 个 dilation branches 的关系：MSC 写的是分成三个 heads，dilation = 1, 2, 3；MMH 公式又写一般 H heads。需要说明：
“For simplicity, we use three scale-aware groups in implementation; the formulation can be generalized to H heads.”
检查 Eq.25 的缩放项：Head-domain attention 的 softmax 缩放建议写成 sqrt(d_a)，不要写成容易被误读的 sqrt(H) d_a 或不清楚的形式。
避免 M 的符号冲突：前文 M 表示 window size，CAB 中 M 又表示 number of heads。建议 window size 用 S，head number 用 H_c 或 M_h。
Fig.3 应该前移或改为 Fig.2：现在 Fig.3 才是整体网络图，但审稿意见要求“完整框架”。建议把新的 design rationale 作为 Fig.1，整体架构作为 Fig.2，spectral fidelity 对比作为 Fig.3 或 supplementary。
补充“不是简单组合”的实验证据：仅靠文字说不是简单组合不够。必须用 cumulative ablation、head affinity visualization、boundary mask evaluation 来支撑。
实验结果表述要减少夸张：例如 “state-of-the-art performance across all datasets” 要谨慎，因为 Pavia 中 SSIM 与多个方法同为 0.996，CAVE 中 SAM 与 PSRT 持平。建议写 “competitive or superior”。
语言细节：
“Another way” 应改成 “In addition” 或 “Recent studies also show”。
“Cave dataset” 应统一为 “CAVE dataset”。
“Table 4” 应统一为 “Table IV”。
“Multiscale” 和 “multi-scale” 统一一种写法。
Fig. caption 中 “LMCAN(Ours)” 改成 “LMCAN (Ours)”。
Related Work 需要更有针对性地比较 MSST-Net、PSRT、PanFormer、HyperTransformer 等：不要只列方法，要明确它们没有解决什么：patch merging、head independence、boundary neighborhood、pixel-token spectral preservation。
回复审稿人时要明确写：
“We have revised the Abstract, Introduction, Related Work, Motivation, Methodology, Fig. 1, Fig. 2, and Ablation Study to clarify that LMCAN is a unified framework derived from the spectral-fidelity/spatial-context trade-off, rather than a simple combination of independent components.”

附件是根据审稿人的意见做出修改的版本，审稿人主要意见是创新点不足，提出的组件之间缺少关联性。为了说明...

问题

思考

回答