Contents

Discover duplicated files

Problem

I have a directory where I store files: articles, images, notes - in a bit unordered and random fashion: I copy them from other devices, generate some files requried for teaching sessions or create subdirectrories for experimentation.

As a result, this directory is messy and contains a lot of duplicated files. Some of them are quite big (think: .epub or .pdf, not linux distro .iso).

So, I thought, this is a good candiate for a practical problem solving exercise joined with language learning practice for Rust and Scala.

The idea is simple: I will scan the directory recursively and for each file I will calculate the SHA-256 sum and put the sum and path to a hash map, printing those sums for which there is more than one path entry.

Experiment

I decided to write this in scala and in rust.

Results

In Scala I use following case class representing a path with its sha sum:

1
case class ShaRes(val sha: String, p: os.Path)

Scala with external process called per file

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
private def usingExternal(p: os.Path): Option[ShaRes] = { 
    val res: os.CommandResult = os.proc("shasum", "-a", "256", p.toString).call()
    List(res.out.text().split("\\s+") *) match {
      case md :: rest => Some(ShaRes(md, p))
      case v => {
        println(Cons.err(s"Could not parse $p"))
        None
      }
    }
  }

I build scala with: scala-cli --power package Main.scala -o books and run it with time ./books /home/karma/Dokumenty/books/

Scala with external process call for each file:

1
2
3
real    0m36,687s
user    0m35,232s
sys     0m4,635s

Scala using MessageDigest:

Here the path is passed to MessageDigets’ update method:

1
2
3
4
5
6
7
8
9
private def usingDigest(p: os.Path): Option[ShaRes] = {
    import java.nio.file.Files
    import java.security.MessageDigest
    
	val md = MessageDigest.getInstance("SHA-256")
    md.update(Files.readAllBytes(p.wrapped))
    val sha = md.digest().map(b => f"$b%02x").mkString
    Some(ShaRes(sha, p))
  }

Result:

1
2
3
real    0m5,642s
user    0m5,899s
sys     0m1,136s

Rust

Sha calculation uses sha2 crate:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
fn sha_read<T: Read>(mut r: T) -> io::Result<String> {
    let mut sha256 = Sha256::new();
    io::copy(&mut r, &mut sha256)?;
    let hash = sha256.finalize();
    let result = format!("{:x}", hash);
    Ok(result.to_string())
}
pub fn sha(fname: &str) -> io::Result<String> {
    let file = File::open(fname)?;
    sha_read(file)
}

and iterates over directory entries using walkdir crate. ShaStore datastructure aims to mimic hashmap approach from scala:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#[derive(Debug)]
struct ShaStore<'a> {
    shas: HashMap<String, Vec<&'a Path>>,
    errors: Vec<String>,
}
impl<'a> ShaStore<'a> {
    fn new() -> Self {
        ShaStore {
            shas: HashMap::new(),
            errors: Vec::new(),
        }
    }
    fn update(&mut self, pref: &'a Path) -> io::Result<()> {
        let shares = pref.to_str().map(|p| sha(p));
        match shares {
            Some(Ok(shaval)) => {
                if let Some(v) = self.shas.get_mut(&shaval) {
                    v.push(pref);
                } else {
                    self.shas.insert(shaval.clone(), vec![pref]);
                }
            }
            Some(Err(e)) => {
                self.errors.push(e.to_string());
            }
            None => self.errors.push("No sha".to_string()),
        };

        Ok(())
    }
}

Rust with calculation:

1
2
3
real    0m4,796s
user    0m3,846s
sys     0m0,704s

First conclusion

Calling a subprocess and wait synchronously is a huge overkill. It seems that digest calculation in-process is really very fast - the difference between scala and rust is not very significant. Well, most of the time we’re just accessing filesystem and calculating sha.

  • individual sha calculations are independent and can be done in parallel
  • what datastructure in Scala is safe for concurrent acces?
  • in scala I can try to spawn a Future; in Rust I could Well, perhaps it is a good moment to try and introuduce some concurrency?