มาสร้าง MobileNetv1 & v2 โดยใช้ PyTorch กัน

16 min readSep 30, 2024

บทความนี้เป็นบทความต่อเนื่อง ในซีรีย์ “สร้างโมเดล โดย PyTorch” ซึ่งบทความก่อนหน้านี้ ก็ได้ทดลองสร้าง LeNet5, AlexNet, VGG-16, Inception, ResNet และ DenseNet กันไปแล้ว สำหรับในบทความนี้จะกล่าวถึงอีกโมเดลหนึ่งที่ได้รับความนิยมอย่างสูง คือ MobileNet โดยโมเดลนี้พัฒนาโดย Google ในปี 2017 โดยเน้นที่ความเร็วในการทำงานที่ดี โดยมีประสิทธิภาพในการทำงานที่ดี แต่ต้องมีขนาดเล็ก เพราะตั้งใจจะนำไปใช้ในอุปกรณ์ขนาดเล็ก เช่น โทรศัพท์มือถือ หรือ embedded device

MobileNet มีอยู่ด้วยกัน 3 เวอร์ชัน ดังนี้

MobileNet V1

โมเดลนี้มีผู้บอกว่าเร็วกว่า VGG16 ถึง 10 เท่า และยังมีขนาดที่เล็กกว่าอีกด้วย เราจะมาดูกันว่าจริงหรือไม่ ในโมเดลนี้มีองค์ประกอบที่สำคัญอยู่ 3 ส่วน ดังนี้

ใช้สถาปัตยกรรมที่เรียกว่า Depth-wise Separable Convolution โดยแบ่งโครงสร้าง Convolution เป็น 2 ส่วน คือ Depthwise Convolution ซึ่งเป็น 3x3 convolution ทำหน้าที่สกัด feature และ Pointwise Convolution ซึ่งเป็น 1x1 convolution ทำหน้าที่ลดขนาดและรวม feature
ใช้ ReLU6 แทน ReLU โดย ReLU6 ก็คือ ReLU ที่กั้นค่าข้อมูลด้านบนไว้ว่าต้องไม่เกิน 6 ทั้งนี้เนื่องจากความตั้งใจของผู้ออกแบบที่นำไปใช้ใน embedded device ซึ่งอาจต้องมีการลดขนาด การใช้ ReLU6 จะทำให้การลดขนาดโมเดลทำได้ง่ายขึ้น
เพิ่ม hyper-parameters ชื่อ Width Multiplier และ Resolution Multiplier ซึ่งช่วยในการปรับแต่งโมเดลให้เหมาะสมกับความต้องการด้านประสิทธิภาพและความแม่นยำ โดย Width Multiplier (α) ทำหน้าที่ควบคุมความกว้างของเครือข่าย ทำให้เครือข่ายมีขนาดเล็กลง และ Resolution Multiplier (ρ) จะทำหน้าที่ปรับขนาดของภาพอินพุต เพื่อให้การทำงานลดลง ซึ่งจะได้กล่าวถึงในรายละเอียดต่อไป

Convolution แบบดั้งเดิม

ก่อนอื่นขอทบทวน convolution แบบดั้งเดิมก่อน การทำงานของ convolution จะเหมือนกับรูปด้านล่าง คือนำเอา kernel หรือ filter มากระทำกับทุก channel ของ input เช่น จากรูปคือการนำเอา kernel ขนาด 3x3 (สีส้ม) มาทำ 2D convolution กับ input ซึ่งแม้ kernel จะมีขนาดแค่ 3x3x1 แต่เมื่อนำไปกระทำกับ input ในทุก channel (n ครั้ง) ก็ไม่ต่างอะไรกับการนำข้อมูลขนาด 3x3xn ไปทำ 3D convolution กับ input โดยผลลัพธ์จะเกิดขึ้นเป็นข้อมูลเพียงข้อมูลเดึยว (สีแดง) เพราะการทำงาน คือ การทำ dot product ระหว่าง kernel สีส้ม กับ ข้อมูลสีม่วง

การทำงานแบบนี้เป็นการกระทำ 2 แบบผสมกัน คือ การกรองเชิงพื้นที่ (spacial filtering) คือ การนำ convolution ในแต่ละ channel มาทำ dot product กับ kernel เสมือนกับ kernel ทำหน้าที่กรองข้อมูลที่สำคัญ และ การรวมกันเชิงเส้น (linear combination) คือการรวมข้อมูลของแต่ละ channel ให้เหลือแค่ข้อมูลเดียว

Depth-wise Separable Convolution

การทำ convolution แบบดั้งเดิม จะไม่สามารถแยกการทำงานทั้งสองส่วน คือ การกรองเชิงพื้นที่ และ การรวมกันเชิงเส้น ออกจากกันได้ จึงได้มีผู้คิดวิธีการใหม่ เรียกว่า การ convolution แบบแยกองค์ประกอบ (factorized convolution) ซึ่ง Depthwise Separable Convolution เป็นหนึ่งในวิธีการนั้น โดยจะแยกการทำงานในส่วน การกรองเชิงพื้นที่ และ การรวมกันเชิงเส้นออกจากกัน ดังรูป

Depthwise Separable Convolution เป็นสถาปัตยกรรม CNN ที่ออกแบบให้เบา (มีการคำนวณน้อย) แต่ยังคงมีประสิทธิภาพที่ดี โดยแยกกระบวนการ convolution มาตรฐานออกเป็นสองขั้นตอนคือ การ convolution ตามความลึก (depthwise convolution) ซึ่งทำหน้าที่เป็นตัวกรองเชิงพื้นที่ และการ convolution แบบจุด (pointwise convolution) ซึ่งทำหน้าที่เป็น การรวมกันเชิงเส้น ตามรูปด้านบน

Depth-wise Convolution

การ convolution ตามความลึก (depthwise convolution) จะใช้ convolution kernel ค่าเดียว สำหรับกระทำกับทุก channel ของอินพุต จากรูปจะเป็น k x k แต่ในทางปฏิบัติมักจะใช้ขนาด 3x3 convolution แต่ประเด็นที่สำคัญคือ แทนที่จะใช้ kernel เดียว กระทำกับทุก channel เหมือนกับ convolution แบบเดิมๆ แต่ depthwise convolution จะแยกแต่ละ kernel สำหรับแต่ละ channel และจะทำเพียงครั้งเดียว

วิธีการนี้ช่วยลด computation cost และ parameter ของโมเดลอย่างมีนัยสำคัญ ทำให้เหมาะสำหรับการใช้งานในสภาพแวดล้อมที่มีทรัพยากรจำกัดโดย ไม่สูญเสียประสิทธิภาพ

เอาละ! มาดูความแตกต่างระหว่าง convolution มาตรฐาน กับ Depth-wise Convolution ในด้าน computation cost

จากรูปด้านบน เป็นการนำเอาเมตริกซ์ Df * Df * M คูณกับ Dk * Dk * M จำนวน N ครั้ง โดย Df เป็นขนาดของ input feature map และ Dk เป็นขนาดของ convolution kernel โดย M เป็นจำนวน channel

ค่า computational cost ของกระบวนการทั้งหมดจะเท่ากับ

ใน convolution มาตรฐาน computational cost จะขึ้นอยู่กับ input channel และ output channel และขนาดของ input feature map และ convolution kernel ที่นำมาคูณกัน

แต่ใน depthwise convolution ตามภาพด้านล่าง จะมีการสร้าง kernel สำหรับแต่ละ channel แยกกัน ดังนั้นจำนวน kernel ของแต่ละ channel จะเท่ากับ 1 เท่านั้น และเมื่อนำมาใช้งาน kernel จะนำมาคูณเฉพาะ channel ของตัวเองเท่านั้น จะเห็นว่ามีจำนวน kernel เท่ากับ M ซึ่งเท่ากับจำนวน channel ของ input

ที่ออกแบบเช่นนี้ เนื่องจากตั้งสมมติฐานว่าใน convolution แบบเดิม จะมี weight หรือ parameter ที่ไม่ใช่งานจำนวนมาก เนื่องจากในแต่ละ kernel อาจจะใช้ได้ดีกับบาง channel เท่านั้น ดังนั้นจึงต้องมี kernel หลายๆ ชั้น เพื่อเก็บ feature จาก layer ต่างๆ ดังนั้นเขาจึงออกแบบให้แต่ละ channel มี kernel สำหรับเก็บข้อมูลของแต่ละชั้นไปเลย ทำให้พารามิเตอร์ของ kernel เป็นพารามิเตอร์ที่มีความหมายมากขึ้น

ดังนั้น ค่า computational cost ของกระบวนการทั้งหมดจะเท่ากับ

จะเห็นว่าเมื่อเทียบกับ convolution แบบเดิมแล้วจะลดการคำนวณลงมาก ในขณะที่สามารถรักษาค่า feature เอาไว้ได้ (ลดลง N เท่า)

โดยสรุป Depth-wise convolution ลด computational cost ได้อย่างมากโดยการแยกการประมวลผลตาม channel ซึ่งลดจำนวนการคำนวณและพารามิเตอร์ลงอย่างมีนัยสำคัญ โดยยังคงรักษาความสามารถในการเรียนรู้คุณลักษณะที่สำคัญไว้ได้

Point-wise Convolution

เนื่องจาก depthwise convolution จะทำหน้าที่เพียงสกัด feature จาก input channel แต่ยังไม่มีส่วนที่ทำหน้าที่ “รวม” feature เหล่านี้เข้าด้วยกัน เพื่อให้เป็นfeature ที่ซับซ้อนขึ้น ดังนั้นจึงมีการออกแบบ layer เพิ่มเติมเพื่อทำหน้าที่นี้ เรียกว่า pointwise convolution ซึ่งจะใช้ 1x1 convolution เพื่อรวมผลลัพธ์ที่ได้จาก depthwise convolution อีกที และเป็นการลดขนาดข้อมูลไปด้วย

ลองมาดูการทำงานของ pointwise convolution โดยมีการทำงานตามรูปด้านล่าง

จะเห็นว่าใช้ 1x1 convolution กระทำกับข้อมูล Input ซึ่งสามารถคำนวณ computational cost ได้ดังนี้

Pointwise convolution cost

จะมีค่าเท่ากับขนาดของ ข้อมูล (Df * Df) คูณด้วยจำนวน channel คูณด้วย N ดังนั้นเมื่อรวมการทำงานของ depthwise convolution กับ pointwise convolution เข้าด้วยกันจะได้เป็น

Depthwise separable convolutions cost

ซึ่งเมื่อเทียบกับ computational cost ของ convolution มาตรฐาน computation จะมีค่าลดลงเท่ากับ

เพื่อให้เห็นภาพจะยกตัวอย่าง เช่น หากกำหนดให้ N=1024 and Dk=3 จะได้ผลลัพธ์เป็น 0.112 ซึ่งหมายความว่า standard convolution ใช้จำนวนการคูณและบวกมากกว่า Depthwise convolution ถึง 9 เท่า

ภาพด้านล่างนี้แสดงการทำงานโดยรวมของ depth-wise separable convolution

Concrete example of depth-wise separable convolutions (source)

เอาละครับ เมื่อเข้าใจการทำงาน คราวนี้ก็มาดูส่วนของโปรแกรม

class DepthWiseSeperable(nn.Module):

    def __init__(self, in_channels , out_channels , stride ):
        """
        DepthWiseSeperable block of MobileNet which performs the following operations:
        (a) depthwise convolution by applying a separate filter for each channel
        (b) pointwise convolutions are applied which combine the filtered result by implementing 1 × 1 convolution
        
            Note:
                1. groups = in_channels used for depthwise convolution
                2. in_channels and out_channels are same for depthwise convolution
                3. bias = False due to the usage of BatchNorm 
                4. To generate same height and width of output feature map as the input feature map, following should be padding for
                    * 1x1 conv : p=0
                    * 3x3 conv : p=1
                    * 5x5 conv : p=2

        Args:
          in_channels (int) : number of input channels
          out_channels (int) : number of output channels 
          stride (int) : stride used for depthwise convolution

        Attributes:
            Depthwise seperable convolutional block

        """

        super(DepthWiseSeperable,self).__init__()
        
        # groups used here
        self.depthwise = nn.Conv2d(in_channels = in_channels , out_channels = in_channels , stride = stride , padding = 1, kernel_size = 3 , groups=in_channels , bias = False)
        self.bn1 = nn.BatchNorm2d(in_channels)

        self.pointwise = nn.Conv2d(in_channels = in_channels , out_channels = out_channels , stride = 1 , padding = 0, kernel_size = 1, bias = False)
        self.bn2 = nn.BatchNorm2d(out_channels)

        self.relu = nn.ReLU6()

    def forward(self,x):

        x = self.depthwise(x)
        x = self.bn1(x)
        x = self.relu(x)
        
        x = self.pointwise(x)
        x = self.bn2(x)
        x = self.relu(x)
        
        return x

จาก code จะเห็นได้ว่า การทำงานจะประกอบด้วย 2 ขั้นตอนเท่านั้น คือ

depthwise(x) ซึ่งทำ 3x3 convolution ต่อด้วย batch normalization และ ReLU ขอให้สังเกตในบรรทัด self.depthwise = nn.Conv2d จะพบว่ามีพารามิเตอร์ตัวหนึ่ง คือ groups=in_channels ซึ่งมีความหมายว่า ให้แต่ละ channel มี kernel ของตัวเอง ซึ่งเป็นหัวใจของ depthwise convolution และ in_channels และ out_channels มีค่าเท่ากัน ซึ่งเป็นลักษณะของ depthwise convolution
pointwise(x) ซึ่งทำ 1x1 convolution ต่อด้วย batch normalization และ ReLU

การทำงานมีเพียงแค่นี้ จะเห็นว่าการทำงานจะเน้นที่ความเรียบง่าย เนื่องจากต้องการให้ประมวลผลได้เร็ว และโมเดลมีขนาดเล็ก อย่างไรก็ตามจะเห็นมรดกเรื่องของการใช้ 3x3 convolution เป็นหลักจาก VGG และ 1x1 convolution จาก Inception และเพิ่มเติมด้วยการนำ depthwise convolution มาใช้

การทำงานของ DepthWiseSeperable สามารถแสดงรายละเอียดได้ ตามภาพด้านล่างนี้

สร้าง MobileNet-v1

เมื่อได้โครงสร้างหลัก คือ Depthwise separable convolution แล้ว คราวนี้ก็นำมาประกอบเป็นเครือข่าย

จากรูปจะเห็นได้ว่า MobileNet v1 จะประกอบด้วยขั้นตอนดังนี้

ทำ convolution จำนวน 1 ครั้งเพื่อสกัด feature พื้นฐานด้วย 3x3 convolution
ทำ DS convolution (Depthwise separable convolution) 2 ครั้ง จำนวน 3 ครั้ง
ทำ DS convolution 6 ครั้งจำนวน 1 ครั้ง
ทำ DS convolution จำนวน 1 ครั้ง
ทำ average pooling ตามด้วย Fully Connected Layer และ Softmax เพื่อทำ classification

โดยมีรายละเอียดตามตาราง

เริ่มจากทำ convolution 3x3 stride 2 จำนวน 32 channel กับ ภาพขนาด 224 x 224 x 3 ได้ผลลัพธ์เป็น 112 x 112 x 32
ทำ depth-wise 3x3 stride 1 จำนวน 32 channel ได้ผลลัพธ์เป็น 112 x 112 x 32
ทำ point-wise 1x1 จำนวน 64 channel ได้ผลลัพธ์เป็น 112 x 112 x 64
ทำ depth-wise 3x3 stride 2 จำนวน 64 channel ได้ผลลัพธ์เป็น 56 x 56 x 64
ทำ point-wise 1x1 จำนวน 128 channel ได้ผลลัพธ์เป็น 56 x 56 x 128
ทำ depth-wise 3x3 stride 2 จำนวน 128 channel ได้ผลลัพธ์เป็น 28 x 28 x 128
ทำ point-wise 1x1 จำนวน 256 channel ได้ผลลัพธ์เป็น 28 x 28 x 256
ทำ depth-wise 3x3 stride 2 จำนวน 256 channel ได้ผลลัพธ์เป็น 14 x 14 x 256
ทำ point-wise 1x1 จำนวน 512 channel ได้ผลลัพธ์เป็น 14 x 14 x 512
ทำ depth-wise 3x3 stride 1 จำนวน 512 channel ตามด้วย point-wise 1x1 จำนวน 512 channel จำนวน 5 ครั้ง ได้ผลลัพธ์เป็น 14 x 14 x 512
ทำ depth-wise 3x3 stride 2 จำนวน 512 channel ได้ผลลัพธ์เป็น 7 x 7 x 512
ทำ point-wise 1x1 จำนวน 1024 channel ได้ผลลัพธ์เป็น 7 x 7 x 1024
ทำ depth-wise 3x3 stride 2 จำนวน 1024 channel ได้ผลลัพธ์เป็น 7 x 7 x 1024
ทำ point-wise 1x1 จำนวน 1024 channel ได้ผลลัพธ์เป็น 7 x 7 x 1024
ทำ average pooling, fully connected และ softmax เพื่อทำ classification

รวมทั้งหมด 28 Layer จะเห็นว่าการทำงานส่วนใหญ่จะอยู่ในขั้นตอน point-wise convolution

Width Multiplier: Thinner Models

แม้ว่าโมเดล MobileNet ที่กล่าวมาจะมีขนาดเล็ก โดยหากขนาดอินพุตภาพเป็น 224x224 จะมีจำนวนพารามิเตอร์เพียง 4.2 ล้านพารามิเตอร์ ซึ่งถือว่าเล็กมากหากเทียบกับ VGG16 ซึ่งมีถึง 138 ล้านพารามิเตอร์ ทำให้ใช้กำลังการประมวลผลไม่มาก แต่ผู้ออกแบบยังเพิ่มพารามิเตอร์สำหรับปรับให้มีขนาดเล็กลงได้อีก สำหรับใช้กับระบบที่เล็กๆ โดยมีพารามิเตอร์ 2 ตัว คือ Width Multiplier และ Resolution Multiplier

Width Multiplier ถึงแม้ชื่อจะเป็นตัวคูณ แต่จริงๆ แล้วเป็นการหารมากกว่า เพราะเป็นการคูณด้วยตัวเลขทศนิยมระหว่าง 0–1 เป็นกลไกหนึ่งที่ใช้เพื่อปรับแต่งความซับซ้อนของโมเดลและความเร็วในการคำนวณ โดยทำหน้าที่ลดจำนวนพารามิเตอร์และขนาดของโมเดลให้เล็กลง โดย Width Multiplier จะใช้สัญญลักษณ์เป็น α (alpha) ซึ่งมีค่าระหว่าง 0 ถึง 1 (เช่น 0.25, 0.5, 0.75, หรือ 1.0)

หลักการของ Width Multiplier คือการลดจำนวน kernel ในแต่ละ convolutional layer ของโมเดลลงตามสัดส่วนที่กำหนด ทำให้มีการคำนวณน้อยลงและลดการใช้ทรัพยากรหน่วยความจำลงตามไปด้วย

ใน MobileNet V1 architecture แต่ละเลเยอร์มีจำนวน kernel ที่คงที่เมื่อ α = 1 (ซึ่งหมายถึงการใช้จำนวน kernel เต็มจำนวนตามปกติ) แต่ถ้าค่า α มีค่าน้อยกว่า 1 จำนวน kernel ที่ใช้ในแต่ละเลเยอร์จะลดลงเป็นสัดส่วนตามค่า α เช่น ถ้า α = 0.5 หมายความว่าในแต่ละเลเยอร์ จำนวน kernel ที่ใช้จะลดลงครึ่งหนึ่งของจำนวนเต็ม

ตัวอย่างเช่น ถ้าเลเยอร์หนึ่งมี 256 kernel ถ้าค่า Width Multiplier α = 0.5 จำนวนkernel จะลดลงเหลือ 128 kernel ทำให้จำนวนพารามิเตอร์ที่ต้องเรียนรู้และจัดเก็บในโมเดลลดลง โมเดลจึงประมวลผลได้เร็วขึ้น ลดการใช้พลังงานในอุปกรณ์ที่มีข้อจำกัด เช่น สมาร์ทโฟนหรือ IoT devices แต่การลดจำนวน kernel อาจส่งผลให้ประสิทธิภาพของโมเดลลดลงได้บ้าง โดยเฉพาะในงานที่มีความซับซ้อนสูง

ลองมาดูผลกระทบของค่า α โดย computational cost ของการทำงานเมื่อเพิ่ม α ลงไปจะเป็นไปตามรูปด้านล่าง

Computational Cost: Depthwise separable convolution with width multiplier

แม้ว่า α จะมีค่าได้ระหว่าง 0 ถึง 1 แต่โดยทั่วไปมักจะใช้ค่าเป็น [1, 0.75, 0.5 และ0.25] หาก α = 1 จะเรียกว่าเป็น baseline MobileNet และถ้าค่า α < 1 จะเรียกว่า reduced MobileNet โดย Width Multiplier จะมีผลกระทบกับการลด computational cost เท่ากับ α²

เนื่องจาก Width Multiplier มีผลกับการทำงานในส่วนของ class DepthWiseSeperable ดังนั้นจะต้องมีการปรับปรุงดังนี้

class DepthWiseSeperable(nn.Module):
    def __init__(self, in_channels, out_channels, stride, width_multiplier=1.0):
        """
        DepthWiseSeperable block of MobileNet with support for Width Multiplier.
        
        Args:
          in_channels (int): Number of input channels.
          out_channels (int): Number of output channels.
          stride (int): Stride used for depthwise convolution.
          width_multiplier (float): Width multiplier to scale number of channels.
        """
        super(DepthWiseSeperable, self).__init__()

        # Adjust the number of input and output channels using the width multiplier
        in_channels = int(in_channels * width_multiplier)
        out_channels = int(out_channels * width_multiplier)
        
        # Depthwise convolution (with groups = in_channels)
        self.depthwise = nn.Conv2d(in_channels=in_channels, out_channels=in_channels, 
                                   stride=stride, padding=1, kernel_size=3, 
                                   groups=in_channels, bias=False)
        self.bn1 = nn.BatchNorm2d(in_channels)

        # Pointwise convolution
        self.pointwise = nn.Conv2d(in_channels=in_channels, out_channels=out_channels, 
                                   stride=1, padding=0, kernel_size=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)

        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.depthwise(x)
        x = self.bn1(x)
        x = self.relu(x)
        
        x = self.pointwise(x)
        x = self.bn2(x)
        x = self.relu(x)
        
        return x

Width Multiplier (width_multiplier): ถูกนำไปคูณกับ in_channels และ out_channels เพื่อปรับลดจำนวน kernel ในแต่ละเลเยอร์ตามค่า Width Multiplier ที่กำหนด

Resolution Multiplier: Reduced Representation

พารามิเตอร์ตัวที่ 2 ใน MobileNet V1 คือ Resolution Multiplier เป็นกลไกที่ใช้เพื่อลดขนาดของ input image ที่ป้อนเข้าสู่โมเดล โดยการลด resolution ของภาพ ซึ่งส่งผลให้จำนวนการคำนวณใน convolution layers ลดลงเช่นกัน ทำให้โมเดลทำงานเร็วขึ้นและใช้หน่วยความจำน้อยลง สำหรับงานที่มีข้อจำกัดเรื่องพลังงานหรือหน่วยความจำ เช่น บนอุปกรณ์มือถือหรือ IoT

การทำงานของ Resolution Multiplier ตัวค่า Resolution Multiplier (บางครั้งเรียกว่า ρ หรือ rho) คือค่าที่ใช้คูณกับขนาดของ input image เพื่อลดขนาด resolution ลงจากขนาดเดิม ตัวอย่างเช่น ถ้าโมเดล MobileNet V1 ปกติใช้ขนาดอินพุตเป็น 224x224 และถ้าเราใช้ค่า ρ = 0.5 จะทำให้ขนาดของ input image ลดลงเป็น 112x112 พิกเซล โดย computational cost เมื่อรวมผลของ Width Multiplier และ Resolution Multiplier จะเป็นไปตามรูป

Computational cost by applying width multiplier and resolution multiplier

ผลจากการใช้ Resolution Multiplier จะลดขนาดของ feature maps ในโมเดล ซี่งช่วยลดจำนวนการคำนวณในแต่ละ layer ลง โมเดลสามารถทำงานได้เร็วขึ้นเนื่องจากมีจำนวน pixel และ feature maps ที่ต้องประมวลผลน้อยลง ลดจำนวนการคำนวณช่วยลดการใช้พลังงาน แต่เนื่องจากขนาด resolution ของภาพเล็กลง โมเดลอาจสูญเสียรายละเอียดบางอย่าง ส่งผลให้ความแม่นยำลดลงในบางงานที่ต้องการรายละเอียดสูง

โปรแกรมต่อไปนี้จะเป็นโปรแกรมของ model MobileNet v1

class MobileNetV1(nn.Module):
    
    def __init__(self, num_classes=1000, resolution_multiplier=1.0, width_multiplier=1.0):
        super(MobileNetV1, self).__init__()

        self.resolution_multiplier = resolution_multiplier
        
        # Initial convolution layer
        self.features = nn.Sequential(
            nn.Conv2d(3, int(32 * width_multiplier), kernel_size=3, stride=2, padding=1, bias=False),
            nn.ReLU(inplace=True),
            nn.BatchNorm2d(int(32 * width_multiplier)),
        )
        
        # Depthwise separable convolutions
        self.features = nn.Sequential(
            self.features,
            DepthWiseSeperable(32, 64, 1, width_multiplier),
            DepthWiseSeperable(64, 128, 2, width_multiplier),
            DepthWiseSeperable(128, 128, 1, width_multiplier),
            DepthWiseSeperable(128, 256, 2, width_multiplier),
            DepthWiseSeperable(256, 256, 1, width_multiplier),
            DepthWiseSeperable(256, 512, 2, width_multiplier),
            
            DepthWiseSeperable(512, 512, 1, width_multiplier),
            DepthWiseSeperable(512, 512, 1, width_multiplier),
            DepthWiseSeperable(512, 512, 1, width_multiplier),
            DepthWiseSeperable(512, 512, 1, width_multiplier),
            DepthWiseSeperable(512, 512, 1, width_multiplier),

            DepthWiseSeperable(512, 1024, 2, width_multiplier),
            DepthWiseSeperable(1024, 1024, 1, width_multiplier)
        )
        
        # Average pooling and classifier
        self.avgpool = nn.AdaptiveAvgPool2d(1)
        self.classifier = nn.Sequential(
            nn.Linear(int(1024 * width_multiplier), num_classes),
        )

    def forward(self, x):
        # Resize input based on resolution_multiplier
        height, width = x.size(2), x.size(3)
        new_height, new_width = int(height * self.resolution_multiplier), int(width * self.resolution_multiplier)
        
        # Resize the input tensor
        x = nn.functional.interpolate(x, size=(new_height, new_width), mode='bilinear', align_corners=False)
        
        # Forward pass through the network
        x = self.features(x)
        x = self.avgpool(x)
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x

จะเห็นว่า class MobileNetV1 จะรับพารามิเตอร์ resolution_multiplier=1.0, width_multiplier=1.0 โดยมีค่า default = 1.0

เปรียบเทียบ MobileNet กับโมเดลอื่น

เมื่อเปรียบเทียบ MobileNet V1 กับโมเดลอื่นๆ ที่นิยมจะแสดงดังภาพ

จะเห็นว่ามีขนาดเล็กลงกว่า GoogleNet 1.5 เท่า และเล็กกว่า VGG16 ถึง 32 เท่า โดยมีจำนวนการคำนวณลดลงมากกว่า GoogleNet ประมาณ 3 เท่า และ VGG16 ประมาณ 26 เท่า ดังนั้นจะทำงานเร็วมากกว่าแน่นอน

ภาพด้านล่างแสดงการเปรียบเทียบระหว่าง Depthwise separable model กับ Standard convolution model จะเห็นว่าความถูกต้องลดลงเพียง 1.1 % แต่ลดขนาดของพารามิเตอร์ได้ถึง 7 เท่า แสดงให้เห็นประสิทธิ์ภาพของ Depthwise separable model ได้เป็นอย่างดี

คราวนี้จะลองมาทดสอบการทำงานของโมเดล MobileNet v1 โดยทำกับข้อมูล CIFAR-10 ซึ่งมีรูปขนาด 32x32 จำนวน 60000 รูป จะใช้โปรแกรมส่วนโหลดข้อมูลดังนี้

import numpy as np
import torch
import torch.nn as nn
from torchvision import datasets, transforms
from torch.utils.data import Dataset, DataLoader, SubsetRandomSampler
import matplotlib.pyplot as plt
import time
t0 = time.time()

# Define relevant variables for the ML task
batch_size = 16
num_classes = 10
learning_rate = 0.001
num_epochs = 30

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)

class CIFAR10Dataset(Dataset):
    def __init__(self, data_dir, train=True, augment=False, valid_size=0.1, random_seed=42):
        self.data_dir = data_dir
        self.train = train
        self.augment = augment
        self.valid_size = valid_size
        self.random_seed = random_seed

        self.normalize = transforms.Normalize(
            mean=[0.4914, 0.4822, 0.4465],
            std=[0.2023, 0.1994, 0.2010],
        )

        if self.train:
            if self.augment:
                self.transform = transforms.Compose([
                    transforms.RandomCrop(32, padding=4),
                    transforms.RandomHorizontalFlip(),
                    transforms.Resize((224, 224)),
                    transforms.ToTensor(),
                    self.normalize,
                ])
            else:
                self.transform = transforms.Compose([
                    transforms.Resize((224, 224)),
                    transforms.ToTensor(),
                    self.normalize,
                ])
        else:
            self.transform = transforms.Compose([
                transforms.Resize((224, 224)),
                transforms.ToTensor(),
                self.normalize,
            ])

        self.dataset = datasets.CIFAR10(
            root=self.data_dir, train=self.train,
            download=True, transform=self.transform,
        )

        if self.train:
            num_train = len(self.dataset)
            indices = list(range(num_train))
            split = int(np.floor(self.valid_size * num_train))
            
            np.random.seed(self.random_seed)
            np.random.shuffle(indices)
            
            self.train_idx, self.valid_idx = indices[split:], indices[:split]
        
    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        return self.dataset[idx]

    def get_train_sampler(self):
        return SubsetRandomSampler(self.train_idx)

    def get_valid_sampler(self):
        return SubsetRandomSampler(self.valid_idx)

def get_data_loaders(data_dir, batch_size, augment=False, valid_size=0.1, random_seed=42):
    train_dataset = CIFAR10Dataset(data_dir, train=True, augment=augment, valid_size=valid_size, random_seed=random_seed)
    test_dataset = CIFAR10Dataset(data_dir, train=False)

    train_loader = DataLoader(
        train_dataset, batch_size=batch_size, sampler=train_dataset.get_train_sampler()
    )

    valid_loader = DataLoader(
        train_dataset, batch_size=batch_size, sampler=train_dataset.get_valid_sampler()
    )

    test_loader = DataLoader(
        test_dataset, batch_size=batch_size, shuffle=True
    )

    return train_loader, valid_loader, test_loader

# Usage:
train_loader, valid_loader, test_loader = get_data_loaders(
    data_dir='./data',
    batch_size=batch_size,
    augment=True,
    random_seed=42
)

และส่วนที่สอนโมเดลดังนี้

model = MobileNetV1().to(device)
# model = ResNet(model_parameters[architecture],in_channels=3, num_classes=10).to(device)

def train_model(model, train_loader, valid_loader, criterion, optimizer, epochs=20, device='cuda'):
    training_logs = {
        "train_loss": [], "train_acc": [], "validate_loss": [], "validate_acc": []
    }

    for epoch in range(epochs):
        # Training phase
        model.train()
        running_loss = 0.0
        correct = 0
        total = 0
        
        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)
            
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            
            running_loss += loss.item()
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
        
        train_loss = running_loss / len(train_loader)
        train_accuracy = 100 * correct / total
        training_logs["train_loss"].append(train_loss)
        training_logs["train_acc"].append(train_accuracy)
        
        # Validation phase
        model.eval()
        running_loss = 0.0
        correct = 0
        total = 0
        with torch.no_grad():
            for images, labels in valid_loader:
                images, labels = images.to(device), labels.to(device)
                outputs = model(images)
                loss = criterion(outputs, labels)
                running_loss += loss.item()
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()
        
        valid_loss = running_loss / len(valid_loader)
        valid_accuracy = 100 * correct / total
        training_logs["validate_loss"].append(valid_loss)
        training_logs["validate_acc"].append(valid_accuracy)
        
        # if epoch % 5 == 0:
        print(f'Epoch [{epoch+1}/{epochs}] :: ',end='')
        print(f'Train Loss: {train_loss:.4f}, Train Accuracy: {train_accuracy:.2f}% ',end='')
        print(f'Valid Loss: {valid_loss:.4f}, Valid Accuracy: {valid_accuracy:.2f}%')
        print('-' * 80)
    
    return training_logs

def plot_graph(training_logs):
    epochs = len(training_logs["train_loss"])
    epochs_range = range(1, epochs + 1)

    plt.figure(figsize=(12, 5))
    
    # Plot loss
    plt.subplot(1, 2, 1)
    plt.plot(epochs_range, training_logs["train_loss"], label='Train Loss')
    plt.plot(epochs_range, training_logs["validate_loss"], label='Valid Loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()
    plt.title('Loss vs Epochs')

    # Plot accuracy
    plt.subplot(1, 2, 2)
    plt.plot(epochs_range, training_logs["train_acc"], label='Train Accuracy')
    plt.plot(epochs_range, training_logs["validate_acc"], label='Valid Accuracy')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy (%)')
    plt.legend()
    plt.title('Accuracy vs Epochs')

    plt.tight_layout()
    plt.show()


def test_model(model, test_loader, criterion, device='cuda'):
    model.eval()
    running_loss = 0.0
    correct = 0
    total = 0
    
    with torch.no_grad():
        for images, labels in test_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            loss = criterion(outputs, labels)
            running_loss += loss.item()
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    
    test_loss = running_loss / len(test_loader)
    test_accuracy = 100 * correct / total
    
    print(f'Test Loss: {test_loss:.4f}, Test Accuracy: {test_accuracy:.2f}%')
    return test_loss, test_accuracy

# model = ResNet(ResidualBlock, [3, 4, 6, 3]).to(device)

# model = ResNet(Bottleneck, [3, 8, 36, 3]).to(device)
criterion = nn.CrossEntropyLoss()
# optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay=0.005, momentum=0.9)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate, betas=(0.9, 0.999), eps=1e-08, weight_decay=1e-5)


training_logs = train_model(model, train_loader, valid_loader, criterion, optimizer, epochs=num_epochs)
print(time.time() - t0)

test_loss, test_accuracy = test_model(model, test_loader, criterion)

plot_graph(training_logs)

ผลที่ได้จากการทำงาน ถือว่าดีมาก โดยใช้เวลาเพียง 62 นาทีเท่านั้น โดยมีความถูกต้องสูง Test Accuracy: 90.88% (Core i5 12400 + RTX 3060 12G)

MobileNet V2

MobileNet v2 มีการปรับปรุงและเปลี่ยนแปลงจาก MobileNet V1 โดยมุ่งเน้นในเรื่องของ computation cost ซึ่งต้องขอท้าวความไปยัง MobileNet v1 หากย้อนกลับขึ้นไปดูโครงสร้าง จะพบว่ามีการทำ Depthwise separable convolution หลายครั้ง โดยเพิ่มจำนวน channel ขึ้นไปเรื่อยๆ จาก 32 -> 64 -> 128 -> 256 -> 512 -> 1024 ตามลำดับ โดยที่ออกแบบมาเช่นนี้ ผู้เขียน paper อธิบายว่า ในการสกัด feature ระหว่างการทำ convolution ของโมเดล จะมีสิ่งหนึ่งที่สกัดได้ เขาเรียกว่า manifold of interest (MOI) ซึ่งข้อมูล MOI ของมิติต่ำ (คือมิติที่มีจำนวน Layer น้อย) จะสามารถฝังอยู่ในข้อมูลของมิติสูงได้ แต่การใช้ ReLU (ซึ่งตัดข้อมูลที่เป็นลบออก) จะทำให้ข้อมูลบางอย่างถูกตัดทิ้งไป แต่ในกรณีที่มีมิติมากๆ ข้อมูลที่หายไปอาจสร้างขึ้นใหม่หรือชดเชยได้จากข้อมูลส่วนอื่นๆ เพราะมีข้อมูลเป็นจำนวนมาก แต่ในมิติน้อยๆ ข้อมูลจะสูญหายไป

เช่น จากรูป เขาแสดงว่า หากมิติต่ำ เช่น 2 มิติ, 3 มิติ หรือ 5 มิติ ข้อมูลบางส่วนของ input จะหายไปจากการทำงานของ ReLU แต่หากจำนวนมิติมากขึ้น แม้จะผ่าน ReLU และมีรูปร่างเปลี่ยนไป แต่ข้อมูลก็ยังอยู่ครบถ้วน

นี่เป็นเหตุผลที่จะต้องเพิ่มจำนวน channel เข้าไปมากๆ และมากขึ้นเรื่อยๆ ตามที่ใช้ใน MobileNet v1 คือ จาก 32 -> 64 -> 128 -> 256 -> 512 -> 1024 วิธีการหนึ่งที่จะทำให้ computation cost หรือใช้การคำนวณลดลง ก็คือ การลดจำนวน channel ของโมเดลลง การคำนวณก็จะลดลงไปเอง และการจะทำแบบนั้นได้ก็จะต้องไม่ใช้ ReLU เพราะ ReLU จะตัดข้อมูล MOI ออก

แล้วจะทำอย่างไร??

Linear Bottlenecks

ด้วยเหตุนี้ ผู้พัฒนาจึงได้ออกแบบสถาปัตยกรรม หรือ โครงสร้างที่เรียกว่า Linear Bottlenecks ซึ่งปรับปรุงมาจาก Depthwise separable ลองมาดู โครงสร้างของ Depthwise separable ก่อน

จะเห็นว่าเริ่มการทำงานด้วยการทำ Depthwise Convolution แล้วตามด้วย BN (Batch Normalization) แล้วจึงทำ ReLU จากนั้น จะทำ 1x1 Pointwise Convolution ที่มีจำนวนมิติมากว่าเดิมหรือเท่าเดิม เช่น เดิมเป็น 64 ก็เพิ่มเป็น 128 เพื่อรักษา MOI ไว้หลังจากการทำ ReLU

แต่ในโครงสร้างใหม่ จะใช้จำนวน channel น้อยลงและตัด ReLU ในชั้นสุดท้ายทิ้ง ซึ่งเป็นที่มาของคำว่า Linear (เนื่องจาก ReLU ทำให้เกิด Non Linear) และเพื่อชดเชยไม่ให้ความเป็น Non Linear หายไปมากเกินไป และเพื่อให้สามารถสกัด MOI ได้ครบถ้วน เขาจึงได้เพิ่ม “Expansion” Layer เข้ามาอีก 1 ชั้นโดยวางเอาไว้ก่อน Depthwise

ชั้น “Expansion” ที่เพิ่มขึ้นมานี้ จะใช้ 1x1 convolution เพื่อทำการเพิ่มมิติขึ้นไปอย่างมาก (จากการทดลองของผู้ออกแบบ ค่าที่เหมาะสมคือ เพิ่มขึ้น 6 เท่า) เช่น ถ้า input เป็น 64 channel ก็จะเพิ่มเป็น 384 channel จากนั้นจึงค่อยผ่าน Depthwise Convolution + BN + ReLU ตามปกติ จากนั้นจึงค่อย “ลดขนาด” ลงสู่ปกติ โดยใช้ “Projection” Layer และตามด้วย BN โดยจะเห็นว่าไม่มีชั้น ReLU และเนื่องจากชั้นสุดท้าย ทำหน้าที่ลดขนาดลง จึงเรียกว่า Linear Bottleneck

การทำงานทั้งหมด สามารถเขียนเป็น block diagram ได้ดังรูปด้านล่างนี้

จะเห็นได้ว่าการทำงานโดยรวม คือ ขยายข้อมูลเพื่อให้สามารถเก็บ MOI ไว้ได้มากๆ จากนั้นใช้ Depthwise ในการสกัดข้อมูล และใช้ Projection เพื่อบีบข้อมูลให้เล็กลง

Inverted Residuals Block

ในรูปด้านบน จะเห็นมีเส้นเชื่อมจากด้านบน และนำมาบวกที่ด้านล่าง ในเรื่องนี้ถ้าคุ้นเคยกับ ResNet ก็คงจำได้ คือ วิธีการที่เรียกว่า Residual Block คือการนำข้อมูล input มาบวกเข้ากับ output เพื่อรักษาข้อมูลเอาไว้

จากรูปด้านล่างในรูป (a) คือ Residual Block หรือ Bottleneck Block ใน ResNet ขั้นตอนจะประกอบด้วยการบีบอัดข้อมูลเพื่อลด parameter จากนั้นทำ convolution แล้วค่อยขยายกลับขึ้นมาให้เท่าเดิม

แต่ใน Inverted Residuals Block ในรูป (b) การทำงานจะตรงกันข้าม คือ จะเริ่มจากการขยายข้อมูลก่อน จากนั้นทำ convolution แล้วจึงค่อยบีบอัดข้อมูลให้กลับมาเท่าเดิม จึงเรียกว่า Inverted Residuals Block

เพื่อให้เห็นภาพขอยกตัวอย่างให้ดู 1 ตัวอย่าง จากรูปจะเห็นว่ามี input ขนาด 56x56x24 จากนั้นขยายเป็น56x56x144 และทำ Depthwise จากนั้นตามด้วย Pointwise เพื่อลดขนาดเป็น 56x56x24 เท่าเดิม

โดยเขียนเป็นโปรแกรมได้ดังนี้

class DepthWise_Conv(nn.Module):
    def __init__(self, in_fts, stride=(1,1)) -> None:
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(in_fts, in_fts, kernel_size=(3,3), stride=stride, padding=(1,1), groups=in_fts, bias=False),
            nn.BatchNorm2d(in_fts),
            nn.ReLU6(inplace=True)
        )

    def forward(self, input_image):
        x = self.conv(input_image)
        return x

class Pointwise_Conv(nn.Module):
    def __init__(self, in_fts, out_fts) -> None:
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(in_fts, out_fts, kernel_size=(1,1), bias=False),
            nn.BatchNorm2d(out_fts)
        )
    def forward(self, input_image):
        x = self.conv(input_image)
        return x

เป็นโปรแกรมส่วน Depthwise convolution และ Pointwise convolution

# Bottleneck Layer when Stride is 1
class NetForStrideOne(nn.Module):
    def __init__(self, in_fts, out_fts, expansion) -> None:
        super().__init__()
        self.conv1 = nn.Sequential(
            nn.Conv2d(in_fts, expansion*in_fts, kernel_size=(1,1), bias=False),
            nn.BatchNorm2d(expansion*in_fts),
            nn.ReLU6(inplace=True)
        )
        self.dw = DepthWise_Conv(expansion*in_fts)
        self.pw = Pointwise_Conv(expansion*in_fts, out_fts)

        self.in_fts = in_fts
        self.out_fts = out_fts
        self.expansion = expansion

    def forward(self, input_image):
        if self.expansion == 1:
            x = self.dw(input_image)
            x = self.pw(x)
        else:
            x = self.conv1(input_image)
            x = self.dw(x)
            x = self.pw(x)

        # If input channel and output channel are same, then perform add
        # residual part
        if self.in_fts == self.out_fts:
            x = input_image + x          

        return x

# Bottleneck layer when Stride is 2
class NetForStrideTwo(nn.Module):
    def __init__(self, in_fts, out_fts, expansion) -> None:
        super().__init__()
        self.conv1 = nn.Sequential(
            nn.Conv2d(in_fts, expansion*in_fts, kernel_size=(1,1), bias=False),
            nn.BatchNorm2d(expansion*in_fts),
            nn.ReLU6(inplace=True)
        )
        self.dw = DepthWise_Conv(expansion*in_fts, stride=(2,2))
        self.pw = Pointwise_Conv(expansion*in_fts, out_fts)

        self.expansion = expansion

    def forward(self, input_image):
        if self.expansion == 1:
            x = self.dw(input_image)
            x = self.pw(x)
        else:
            x = self.conv1(input_image)
            x = self.dw(x)
            x = self.pw(x)      

        return x

เป็นโปรแกรมสำหรับสร้าง Bottleneck layer ซึ่ง จะมี 2 แบบ คือ stride=1 และ stride=2 โดยหากเป็น stride=1 จะมีการนำ input มาบวกเข้ากับกับผลลัพธ์ เนื่องจากมีขนาดของ input และ output เท่ากัน จึงสามารถบวกด้วยกันได้ และหากเป็น stride=2 จะมีขนาดของ input และ output ไม่เท่ากัน จึงไม่สามารถบวกเข้าด้วยกันได้ จึงเป็นการทำตามขั้นตอนปกติ

สถาปัตยกรรมของ MobileNetV2

เอาละครับ! เมื่อทำความเข้าใจกับโครงสร้างที่สำคัญของ MobileNetV2 แล้ว คราวนี้เรามาดูสถาปัตยกรรมของมันกัน

จากตารางจะเห็นว่าการทำงานประกอบด้วยขั้นตอนดังนี้

ทำ 3x3 convolution จำนวน 32 channel กับข้อมูล input ขนาด 224x224 ได้ผลลัพธ์เป็น 112x112x32
ทำ bottleneck layer จำนวน 16 channel โดยยังไม่ขยายข้อมูล (t=1) ได้ผลลัพธ์เป็น 112x112x16
ทำ bottleneck layer จำนวน 24 channel จำนวน 2 ครั้ง โดยขยายข้อมูล (t=6) ได้ผลลัพธ์เป็น 56x56x24
ทำ bottleneck layer จำนวน 32 channel จำนวน 3 ครั้ง โดยขยายข้อมูล (t=6) ได้ผลลัพธ์เป็น 28x28x32
ทำ bottleneck layer จำนวน 64 channel จำนวน 4 ครั้ง โดยขยายข้อมูล (t=6) ได้ผลลัพธ์เป็น 14x14x64
ทำ bottleneck layer จำนวน 96 channel จำนวน 3 ครั้ง โดยขยายข้อมูล (t=6) ได้ผลลัพธ์เป็น 14x14x96
ทำ bottleneck layer จำนวน 160 channel จำนวน 3 ครั้ง โดยขยายข้อมูล (t=6) ได้ผลลัพธ์เป็น 7x7x160
ทำ bottleneck layer จำนวน 320 channel จำนวน 1 ครั้ง โดยขยายข้อมูล (t=6) ได้ผลลัพธ์เป็น 7x7x320
ทำ 1x1 convolution จำนวน 1280 channel กับข้อมูล input ขนาด 7x7x320 ได้ผลลัพธ์เป็น 7x7x1280
ทำ average pooling และ 1x1x1280

แสดงเป็นผังงานได้ดังนี้

และแสดงรายละเอียดได้ดังนี้

สามารถเขียนเป็นโปรแกรมได้ดังนี้

class MobileNet_v2(nn.Module):
    def __init__(self, bottleneckLayerDetails, in_fts=3, numClasses=10, width_multiplier=1) -> None:
        super().__init__()
        self.bottleneckLayerDetails = bottleneckLayerDetails
        self.width_multiplier = width_multiplier

        self.conv1 = nn.Sequential(
            nn.Conv2d(in_fts, round(width_multiplier*32), kernel_size=(3,3), stride=(2,2), padding=(1,1), bias=False),
            nn.BatchNorm2d(round(width_multiplier*32)),
            nn.ReLU6(inplace=True)
        )
        self.in_fts = round(width_multiplier*32)
        
        # Defined bottleneck layer as per Table 2
        self.layerConstructed = self.constructLayer()

        # Top layers after bottleneck
        self.feature = nn.Sequential(
            nn.Conv2d(self.in_fts, round(width_multiplier*1280), kernel_size=(1,1), bias=False),
            nn.BatchNorm2d(round(width_multiplier*1280)),
            nn.ReLU6(inplace=True)
        )

        self.avgpool = nn.AdaptiveAvgPool2d(output_size=(1,1))

        self.outputLayer = nn.Sequential(
            nn.Dropout2d(),
            nn.Conv2d(round(width_multiplier*1280), numClasses, kernel_size=(1,1)),
        )

    def forward(self, input_image):
        x = self.conv1(input_image)
        x = self.layerConstructed(x)
        x = self.feature(x)
        x = self.avgpool(x)
        x = self.outputLayer(x)
        x = torch.flatten(x, 1)  # Flatten the output to [batch size, num_classes]
        return x

    # Defined function to construct the layer based on bottleneck layer defined in Table 2
    def constructLayer(self):
        itemIndex = 0
        block = OrderedDict()
        # iterating the defined layer details
        for lItem in self.bottleneckLayerDetails:
            # each items assigned corresponding values
            t, out_fts, n, stride = lItem
            # If width multipler is mentioned then perform this line
            out_fts = round(self.width_multiplier*out_fts)
            # for stride value 1
            if stride == 1:
                # constructedd the NetForStrideOne module by n times
                for nItem in range(n):
                    block[str(itemIndex)+"_"+str(nItem)] = NetForStrideOne(self.in_fts, out_fts, t)
                    self.in_fts = out_fts
            # for stride value 2
            elif stride == 2:
                # First layer constructed for NetForStrideTwo module once only
                block[str(itemIndex)+"_"+str(0)] = NetForStrideTwo(self.in_fts, out_fts, t)
                self.in_fts = out_fts
                # Remaining will be NetForStrideOne module (n-1) times
                for nItem in range(1,n):
                    block[str(itemIndex)+"_"+str(nItem)] = NetForStrideOne(self.in_fts, out_fts, t)
            itemIndex += 1

        return nn.Sequential(block)

bottleneckLayerDetails = [
        # (expansion, out_dimension, number_of_times, stride)
            (1,16,1,1),
            (6,24,2,2),
            (6,32,3,2),
            (6,64,4,2),
            (6,96,3,1),
            (6,160,3,2),
            (6,320,1,1)
        ]

model = MobileNet_v2(bottleneckLayerDetails, width_multiplier=1).to(device)

และเมื่อเปรียบเทียบกับ MobileNetV1 จะเห็นว่า MobileNetV2 มีจำนวน channel และหน่วยความจำที่ใช้น้อยกว่าประมาณ 4 เท่า

โดยมีจำนวน parameter ลดลงเหลือประมาณ 3.4 ล้านพารามิเตอร์ มีการใช้ operation หรือจำนวนการทำงานลดลงจาก 575 ล้านเหลือ 300 ล้าน และใช้เวลาในการทำงานลดลงเหลือ 66 เปอร์เซนต์เมื่อเทียบกับ MobileNetV1

ในด้านของความถูกต้อง (Accuracy) ก็มีความถูกต้องเพิ่มขึ้นเล็กน้อย ตามรูปด้านล่าง

สรุป

เอาละครับ! ถึงตรงนี้ก็ได้อธิบายสถาปัตยกรรมและโครงสร้างการทำงานของ MobileNet v1 ซึ่งนำ Depth-wise Separable Convolution มาใช้เพิ่มประสิทธิภาพการทำงาน และ MobileNet v2 ที่ได้นำ Inverted Residual Block และ Linear Bottlenecks มาใช้ จะเห็นถึงความพยายามในการลดขนาดโมเดล และ ลด computational cost แต่ยังไม่หมดแค่นี้ เพราะทาง google ยังพยายามหาโมเดลที่ดีขึ้นต่อไป แต่ขอยกไปไว้ในบทความหน้า เพราะยาวมากแล้ว

Reference

https://medium.com/@godeep48/an-overview-on-mobilenet-an-efficient-mobile-vision-cnn-f301141db94d

https://medium.com/@karuneshu21/implement-mobilenet-v1-in-pytorch-fd03a6618321

Know about MobileNet v2 & Implementation from Scratch Using Pytorch

Hi Guys! In this blogs, I will share my knowledge, after reading this research paper, what it is all about!

sahiltinky94.medium.com

https://medium.com/@luis_gonzales/a-look-at-mobilenetv2-inverted-residuals-and-linear-bottlenecks-d49f85c12423

MobileNetV2: Inverted Residuals and Linear Bottlenecks

In April 2017 a group of researchers from Google published a paper which introduced a neural network architecture that…

towardsdatascience.com