Discover duplicated files

Problem
I have a directory where I store files: articles, images, notes - in a bit unordered and random fashion: I copy them from other devices, generate some files requried for teaching sessions or create subdirectrories for experimentation.
As a result, this directory is messy and contains a lot of duplicated files. Some of them are quite big (think: .epub or .pdf, not linux distro .iso).
So, I thought, this is a good candiate for a practical problem solving exercise joined with language learning practice for Rust and Scala.
The idea is simple: I will scan the directory recursively and for each file I will calculate the SHA-256 sum and put the sum and path to a hash map, printing those sums for which there is more than one path entry.
Experiment
I decided to write this in scala and in rust.
Results
In Scala I use following case class representing a path with its sha sum:
|
|
Scala with external process called per file
|
|
I build scala with:
scala-cli --power package Main.scala -o books
and run it with
time ./books /home/karma/Dokumenty/books/
Scala with external process call for each file:
|
|
Scala using MessageDigest:
Here the path is passed to MessageDigets’ update method:
|
|
Result:
|
|
Rust
Sha calculation uses sha2 crate:
|
|
and iterates over directory entries using walkdir crate. ShaStore datastructure aims to mimic hashmap approach from scala:
|
|
Rust with calculation:
|
|
First conclusion
Calling a subprocess and wait synchronously is a huge overkill. It seems that digest calculation in-process is really very fast - the difference between scala and rust is not very significant. Well, most of the time we’re just accessing filesystem and calculating sha.
- individual sha calculations are independent and can be done in parallel
- what datastructure in Scala is safe for concurrent acces?
- in scala I can try to spawn a
Future; in Rust I could Well, perhaps it is a good moment to try and introuduce some concurrency?