익명 17:46

How to find duplicate images, possibly with different resolutions?

How to find duplicate images, possibly with different resolutions?

I have some images (photos) and there are duplicates but no matter how I sort them they are scattered because of resolution and irregular naming.

I tried gm compare but can't figure out which metric to use or which values would indicate a match.

Heres examples of an image that looks exactly the same but the second one is 2x resolution (better quality):

gm compare -metric MAE "7920068.jpg" "7920034.jpg"
gm compare -metric MSE "7920068.jpg" "7920034.jpg"
gm compare -metric PAE "7920068.jpg" "7920034.jpg"
gm compare -metric PSNR "7920068.jpg" "7920034.jpg"
gm compare -metric RMSE "7920068.jpg" "7920034.jpg"

Image Difference (MeanAbsoluteError):
           Normalized    Absolute
          ============  ==========
     Red: 0.1751787015    11480.3
   Green: 0.1168407563     7657.2
    Blue: 0.0029600541      194.0
   Total: 0.0983265040     6443.8

Image Difference (MeanSquaredError):
           Normalized    Absolute
          ============  ==========
     Red: 0.0910979679     5970.1
   Green: 0.0274231091     1797.2
    Blue: 0.0000203617        1.3
   Total: 0.0395138129     2589.5

Image Difference (PeakAbsoluteError):
           Normalized    Absolute
          ============  ==========
     Red: 1.0000000000    65535.0
   Green: 0.7803921569    51143.0
    Blue: 0.0784313725     5140.0
   Total: 1.0000000000    65535.0

Image Difference (PeakSignalToNoiseRatio):
           PSNR
          ======
     Red: 10.40
   Green: 15.62
    Blue: 46.91
   Total: 14.03

Image Difference (RootMeanSquaredError):
           Normalized    Absolute
          ============  ==========
     Red: 0.3018243991    19780.1
   Green: 0.1655992426    10852.5
    Blue: 0.0045123979      295.7
   Total: 0.1987808163    13027.1

with graphicsmagick identify i found these values

          |image a        |image a @2x    |image b
Red:
  Minimum:|  0.00 (0.0000)|  0.00 (0.0000)|  0.00 (0.0000)
  Maximum:|255.00 (1.0000)|255.00 (1.0000)|255.00 (1.0000)
  Mean:   |175.81 (0.6894)|176.00 (0.6902)|117.79 (0.4619)
  Std Dev:| 65.59 (0.2572)| 65.73 (0.2577)| 61.55 (0.2414)
Green:
  Minimum:|  0.00 (0.0000)|  0.00 (0.0000)|  0.00 (0.0000)
  Maximum:|255.00 (1.0000)|255.00 (1.0000)|255.00 (1.0000)
  Mean:   |161.58 (0.6336)|162.47 (0.6371)| 99.07 (0.3885)
  Std Dev:| 71.14 (0.2790)| 71.26 (0.2794)| 64.94 (0.2547)
Blue:
  Minimum:|  0.00 (0.0000)|  0.00 (0.0000)|  0.00 (0.0000)
  Maximum:|255.00 (1.0000)|255.00 (1.0000)|255.00 (1.0000)
  Mean:   |153.59 (0.6023)|153.27 (0.6010)|104.50 (0.4098)
  Std Dev:| 71.65 (0.2810)| 71.67 (0.2811)| 60.09 (0.2357)

looks like i can use these values to compare, the image a files have very similar values compared to image b, just need to get a good threshold to indicate what might be a match

I'll use these images as an example:

  1. different image BOSS8
  2. subject image BOSS1
  3. subject image at half size BOSS12

and here's their output:

gm identify -verbose BOSS-1.jpg   
Image: BOSS-1.jpg
  Format: JPEG (Joint Photographic Experts Group JFIF format)
  Geometry: 591x1049
  Class: DirectClass
  Type: true color
  Depth: 8 bits-per-pixel component
  Channel Depths:
    Red:      8 bits
    Green:    8 bits
    Blue:     8 bits
  Channel Statistics:
    Red:
      Minimum:                     7.00 (0.0275)
      Maximum:                   255.00 (1.0000)
      Mean:                       89.97 (0.3528)
      Standard Deviation:         79.68 (0.3125)
    Green:
      Minimum:                    11.00 (0.0431)
      Maximum:                   255.00 (1.0000)
      Mean:                      108.55 (0.4257)
      Standard Deviation:         70.34 (0.2758)
    Blue:
      Minimum:                     8.00 (0.0314)
      Maximum:                   255.00 (1.0000)
      Mean:                      126.50 (0.4961)
      Standard Deviation:         68.28 (0.2678)
  Resolution: 72x72 pixels
  Filesize: 129.6Ki
  Interlace: No
  Orientation: Unknown
  Background Color: white
  Border Color: #DFDFDF
  Matte Color: #BDBDBD
  Page geometry: 591x1049+0+0
  Compose: Over
  Dispose: Undefined
  Iterations: 0
  Compression: JPEG
  JPEG-Quality: 93
  JPEG-Colorspace: 2
  JPEG-Colorspace-Name: RGB
  JPEG-Sampling-factors: 2x2,1x1,1x1
  Signature: 06a764225a290be783b0b3b90c72356f71b0032af8f58e88857c33d6e59b8ccc
  Profile-EXIF: 74 bytes
    Exif Offset: 26
    Color Space: 1
    Exif Image Width: 591
    Exif Image Length: 1049
  Tainted: False
  Elapsed Time: 0m:0.011805s
  Pixels Per Second: 50.1Mi

$ gm identify -verbose BOSS-1-50.jpg
Image: BOSS-1-50.jpg
  Format: JPEG (Joint Photographic Experts Group JFIF format)
  Geometry: 296x525
  Class: DirectClass
  Type: true color
  Depth: 8 bits-per-pixel component
  Channel Depths:
    Red:      8 bits
    Green:    8 bits
    Blue:     8 bits
  Channel Statistics:
    Red:
      Minimum:                     7.00 (0.0275)
      Maximum:                   255.00 (1.0000)
      Mean:                       89.34 (0.3504)
      Standard Deviation:         78.83 (0.3091)
    Green:
      Minimum:                    12.00 (0.0471)
      Maximum:                   255.00 (1.0000)
      Mean:                      107.87 (0.4230)
      Standard Deviation:         70.29 (0.2756)
    Blue:
      Minimum:                    14.00 (0.0549)
      Maximum:                   255.00 (1.0000)
      Mean:                      125.77 (0.4932)
      Standard Deviation:         68.19 (0.2674)
  Resolution: 72x72 pixels
  Filesize: 44.2Ki
  Interlace: No
  Orientation: Unknown
  Background Color: white
  Border Color: #DFDFDF
  Matte Color: #BDBDBD
  Page geometry: 296x525+0+0
  Compose: Over
  Dispose: Undefined
  Iterations: 0
  Compression: JPEG
  JPEG-Quality: 93
  JPEG-Colorspace: 2
  JPEG-Colorspace-Name: RGB
  JPEG-Sampling-factors: 2x2,1x1,1x1
  Signature: 2c12437d162d8bf92ad49497e2644ca3a5edd9d3c8947d44445a5923565123cc
  Profile-EXIF: 74 bytes
    Exif Offset: 26
    Color Space: 1
    Exif Image Width: 296
    Exif Image Length: 525
  Tainted: False
  Elapsed Time: 0m:0.002051s
  Pixels Per Second: 72.3Mi

$ gm identify -verbose BOSS-8.jpg   
Image: BOSS-8.jpg
  Format: JPEG (Joint Photographic Experts Group JFIF format)
  Geometry: 584x1050
  Class: DirectClass
  Type: true color
  Depth: 8 bits-per-pixel component
  Channel Depths:
    Red:      8 bits
    Green:    8 bits
    Blue:     8 bits
  Channel Statistics:
    Red:
      Minimum:                     0.00 (0.0000)
      Maximum:                   255.00 (1.0000)
      Mean:                       91.51 (0.3589)
      Standard Deviation:         85.21 (0.3341)
    Green:
      Minimum:                     0.00 (0.0000)
      Maximum:                   255.00 (1.0000)
      Mean:                      110.18 (0.4321)
      Standard Deviation:         83.58 (0.3278)
    Blue:
      Minimum:                     0.00 (0.0000)
      Maximum:                   255.00 (1.0000)
      Mean:                      132.97 (0.5214)
      Standard Deviation:         87.69 (0.3439)
  Resolution: 72x72 pixels
  Filesize: 180.5Ki
  Interlace: No
  Orientation: Unknown
  Background Color: white
  Border Color: #DFDFDF
  Matte Color: #BDBDBD
  Page geometry: 584x1050+0+0
  Compose: Over
  Dispose: Undefined
  Iterations: 0
  Compression: JPEG
  JPEG-Quality: 93
  JPEG-Colorspace: 2
  JPEG-Colorspace-Name: RGB
  JPEG-Sampling-factors: 2x2,1x1,1x1
  Signature: 9d12ad4d93d1c8d219d41ef9755984bcb151a8de502c70279aea4b69202c99d1
  Profile-EXIF: 74 bytes
    Exif Offset: 26
    Color Space: 1
    Exif Image Width: 584
    Exif Image Length: 1050
  Tainted: False
  Elapsed Time: 0m:0.016498s
  Pixels Per Second: 35.4Mi


Top Answer/Comment:

You can try normalizing the images by resizing them to have a square aspect ratio with a known resolution. Comparing the normalized images results in quite low values (~100) for the MSE metric:

$ gm convert -geometry 1000x1000! same-big.jpg norm-same-big.jpg
$ gm convert -geometry 1000x1000! same-small.jpg norm-same-small.jpg
$ gm convert -geometry 1000x1000! different.jpg norm-different.jpg

$ gm compare -metric mse norm-same-big.jpg norm-same-small.jpg
Image Difference (MeanSquaredError):
           Normalized    Absolute
          ============  ==========
     Red: 0.0015487693      101.5
   Green: 0.0009830381       64.4
    Blue: 0.0015041910       98.6
   Total: 0.0013453328       88.2

$ gm compare -metric mse norm-same-big.jpg norm-different.jpg
Image Difference (MeanSquaredError):
           Normalized    Absolute
          ============  ==========
     Red: 0.0829284628     5434.7
   Green: 0.0682458298     4472.5
    Blue: 0.0753763994     4939.8
   Total: 0.0755168974     4949.0

You could easily turn this into a script that takes two filenames, normalizes them, compares the normalized images, and then reports back the original filenames if the difference is close enough.

상단 광고의 [X] 버튼을 누르면 내용이 보입니다