Discover duplicated files

2026-03-29 593 words 3 minutes

/en/posts/2026-03-29_duplicate_files_detection/image.png

Contents

Problem

I have a directory where I store files: articles, images, notes - in a bit unordered and random fashion: I copy them from other devices, generate some files requried for teaching sessions or create subdirectrories for experimentation.

As a result, this directory is messy and contains a lot of duplicated files. Some of them are quite big (think: .epub or .pdf, not linux distro .iso).

So, I thought, this is a good candiate for a practical problem solving exercise joined with language learning practice for Rust and Scala.

The idea is simple: I will scan the directory recursively and for each file I will calculate the SHA-256 sum and put the sum and path to a hash map, printing those sums for which there is more than one path entry.

Experiment

I decided to write this in scala and in rust.

Results

In Scala I use following case class representing a path with its sha sum:

1

case class ShaRes(val sha: String, p: os.Path)

Scala with external process called per file

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


private def usingExternal(p: os.Path): Option[ShaRes] = { 
    val res: os.CommandResult = os.proc("shasum", "-a", "256", p.toString).call()
    List(res.out.text().split("\\s+") *) match {
      case md :: rest => Some(ShaRes(md, p))
      case v => {
        println(Cons.err(s"Could not parse $p"))
        None
      }
    }
  }

I build scala with: scala-cli --power package Main.scala -o books and run it with time ./books /home/karma/Dokumenty/books/

Scala with external process call for each file:

1
2
3


real    0m36,687s
user    0m35,232s
sys     0m4,635s

Scala using `MessageDigest`:

Here the path is passed to MessageDigets’ update method:

1
2
3
4
5
6
7
8
9


private def usingDigest(p: os.Path): Option[ShaRes] = {
    import java.nio.file.Files
    import java.security.MessageDigest
    
	val md = MessageDigest.getInstance("SHA-256")
    md.update(Files.readAllBytes(p.wrapped))
    val sha = md.digest().map(b => f"$b%02x").mkString
    Some(ShaRes(sha, p))
  }

Result:

1
2
3


real    0m5,642s
user    0m5,899s
sys     0m1,136s

Rust

Sha calculation uses sha2 crate:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


fn sha_read<T: Read>(mut r: T) -> io::Result<String> {
    let mut sha256 = Sha256::new();
    io::copy(&mut r, &mut sha256)?;
    let hash = sha256.finalize();
    let result = format!("{:x}", hash);
    Ok(result.to_string())
}
pub fn sha(fname: &str) -> io::Result<String> {
    let file = File::open(fname)?;
    sha_read(file)
}

and iterates over directory entries using walkdir crate. ShaStore datastructure aims to mimic hashmap approach from scala:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31


#[derive(Debug)]
struct ShaStore<'a> {
    shas: HashMap<String, Vec<&'a Path>>,
    errors: Vec<String>,
}
impl<'a> ShaStore<'a> {
    fn new() -> Self {
        ShaStore {
            shas: HashMap::new(),
            errors: Vec::new(),
        }
    }
    fn update(&mut self, pref: &'a Path) -> io::Result<()> {
        let shares = pref.to_str().map(|p| sha(p));
        match shares {
            Some(Ok(shaval)) => {
                if let Some(v) = self.shas.get_mut(&shaval) {
                    v.push(pref);
                } else {
                    self.shas.insert(shaval.clone(), vec![pref]);
                }
            }
            Some(Err(e)) => {
                self.errors.push(e.to_string());
            }
            None => self.errors.push("No sha".to_string()),
        };

        Ok(())
    }
}

Rust with calculation:

1
2
3


real    0m4,796s
user    0m3,846s
sys     0m0,704s

First conclusion

Calling a subprocess and wait synchronously is a huge overkill. It seems that digest calculation in-process is really very fast - the difference between scala and rust is not very significant. Well, most of the time we’re just accessing filesystem and calculating sha.

individual sha calculations are independent and can be done in parallel
what datastructure in Scala is safe for concurrent acces?
in scala I can try to spawn a Future; in Rust I could Well, perhaps it is a good moment to try and introuduce some concurrency?