基于多尺度Transformer特征的道路场景语义分割网络
DOI:
作者:
作者单位:

湖北汽车工业学院智能网联汽车学院,湖北 十堰 442002

作者简介:

彭洋(2000—),男,硕士研究生,研究方向为计算机视觉和语义分割。E-mail:1172390843@qq.com。

通讯作者:

中图分类号:

TP391.41;U491.1

基金项目:

湖北省自然科学基金联合基金项目(2025AFD239);湖北汽车工业学院博士科研启动基金项目(BK202347)


Road Scene Semantic Segmentation Network Based on Multi-Scale Transformer Features
Author:
Affiliation:

School of Intelligent and Connected Vehicle, Hubei University of Automotive Technology, Shiyan 442002 , China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    道路场景中图像通常内容复杂,不同物体之间的尺度和形态差异较大,并且光照阴影等情况会让场景变得难以识别。 而现有语义分割方法通常不能有效提取并充分融合多尺度语义特征,泛化能力和鲁棒性较差。文章提出了一种融合多尺度 Transformer 特征的语义分割网络模型。首先,利用 CSWin Transformer 提取不同尺度的语义特征,并且引入特征细化模块 (FRM)提升深层小尺度特征的语义辨析能力;其次,采用注意力聚合模块(AAM)对不同尺度特征分别进行聚合;最后,通过融合这些增强后的多尺度特征,进一步提升特征的语义表达能力,从而提高分割性能。实验结果表明:该网络模型在 Cityscapes数据集上取得了82.3%的准确率,较SegNeXt和ConvNeXt分别提升了2.2个百分点和1.2个百分点;在目前最具挑战性的ADE20K数据集上取得了47.4%的准确率,较SegNeXt和ConvNeXt分别提升了3.2个百分点和1.8个百分点。所提出的融合多尺度Transformer特征模型不仅具有较高的语义分割精度,能准确预测道路场景图像的像素语义类别,而且具有较强的泛化性能和鲁棒性。

    Abstract:

    Image contents in road scenes are usually complex, with significant differences in scale and shape between different objects, and lighting and shadows can make the scenes difficult to recognize. However, existing semantic segmentation methods often fail to effectively extract and fully integrate multi-scale semantic features, resulting in poor generalization ability and robustness. To address these issues, this study proposes a semantic segmentation network model that fuses multi-scale Transformer features. Firstly, the CSWin Transformer was employed to extract semantic features at various scales, accompanied by the introduction of a feature refinement module (FRM) to enhance the semantic discrimination capability of deep, fine-grained features. Secondly, an attention aggregation module (AAM) was adopted to separately aggregate features across scales. Finally, by integrating these enhanced multi-scale features, the semantic expression ability of the features was further enhanced,thereby improving segmentation performance. Experimental results demonstrate that this network model achieves an accuracy of 82.3% on the Cityscapes dataset, outperforming SegNeXt and ConvNeXt by 2.2 percentage points and 1.2 percentage points, respectively. Moreover, it attains an accuracy of 47.4% on the highly challenging ADE20K dataset, surpassing SegNeXt and ConvNeXt by 3.2 percentage points and 2.8 percentage points, respectively. The proposed multi-scale Transformer feature fusion model not only achieves high semantic segmentation accuracy, accurately predicting pixel semantic categories of road scene images, but also has strong generalization performance and robustness.

    参考文献
    相似文献
    引证文献
引用本文

彭洋,吴文欢,张淏坤. 基于多尺度Transformer特征的道路场景语义分割网络[J]. 华东交通大学学报,2025,42 (2):110-118.

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-09-14
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2025-05-16
  • 出版日期:
关闭