9

I've got a couple of disks which have a large set of files which are mostly the same. However, in a few cases there are files on one disk which differ from files on the other disk. There are also a lot of files where the files are identical on both drives, but the timestamps differ.

For my purposes I need to find just the files which are actually different. If I run:

rsync --dry-run -HPrlt

it find not only the files that are different, but also the files that differ only by timestamp, leaving me with extra work to determine if these are false positives or not.

I also thought to try:

rsync -c --dry-run -HPrlt

But this command takes a lot longer. In fact, the former command ran in a few seconds (presumably because the directory structure was already in cache from a previous rsync) whereas the latter command is still running. I suspect that this is because rsync is relying entirely on the checksum to determine what files need transferring, instead of something slightly more intelligent like only using a checksum if the timestamps differ.

How can I quickly see only the files that actually differ?

Note: This is not a duplicate of How to print files that would have been changed using rsync? because as the highest rated comment to the highest rated answer points out, --dry-run will show files that are identical if their timestamps differ.

Michael
  • 2,824

3 Answers3

9

rsync -HPrl --itemize-changes --dry-run source/ destination/ | grep -Fv "f..T......"| grep -Fv "d..T......"| cut -d " " -f 2-

*Do Not Miss Trailing Slash For The Source Directory.

--itemize-changes outputs a change-summary for all updates. When combined with --dry-run and greping out files/dirs for which only timestamp update is required, gives the required output (quickly).

sxc731
  • 475
mittal
  • 191
1

It's possible to run rsync in two stages :

  1. Generate a list of all files differing in size or timestamp (which might wrongly include some files that are identical)
  2. rsync using this list and with the checksum-compare option to find the real differences.

This answer is based on the post
Reuse rsync --dry-run output to speed up the actual transfer later on.

To use the file list generated during a dry run as an include file, requires removing the extra lines at the top and bottom of the dry-run output.

Example output:

sending incremental file list
[LIST OF FILES]

sent 226 bytes received 34 bytes 520.00 bytes/sec total size is 648,373,274 speedup is 2,493,743.36 (DRY RUN)

To remove the superfluous lines and leave only the files-list:

rsync --dry-run -avz source/ destination/ | head --lines=-3 | tail --lines=+3 > include.txt

For rsync to use this file (add additional options as desired):

rsync -c --include-from=include.txt --exclude=* source/ destination/

EDIT: I have reproduced the problem according to the poster's gist and can add to the description that whenever --dry-run is specified, then all the files are marked for sync, no matter which combination of parameters is used.

I think that the problem is actually with --dry-run, perhaps because it's checking too many metadata attributes. Seems perhaps like a bug.

harrymc
  • 498,455
1

This could be an XY problem: you have an issue to solve, but are asking how to solve it with rsync.

The OP question asks about rsync, but its possible that a timestamp difference will always show as "different". I'm not sure theres a "-c but ignore timestamp" option. No matter what tool you decide to use, you must read the entire file to verify its content.

Here is a possible alternative (non-rsync) solution:

Hash the trees and find the differences. This will produce a list of files that differ. By "differ" I mean any of:

  • the content changed
  • the file exists on one side but not the other
cd /tree1
find -type f -print0 | sort -z | xargs -0 md5sum > /tmp/tree1.log

cd /tree2 find -type f -print0 | sort -z | xargs -0 md5sum > /tmp/tree2.log

diff -uw /tmp/tree1.log /tmp/tree2.log | grep '^[+-]' | awk '{print $2}' | sort -u

(For the md5sum naysayers: I know md5 is broken in the sense of of finding a second preimage attack...but the OP (probably) isn't looking for something cryptographically critical and md5 is faster than sha256.)

KJ7LNW
  • 548