0

(This is a follow-up to GNU "parallel --pipe" doesn't process stdin by lines as I cannot get the behavior I want even after reading that question and applying the answer that was given.)

I'm trying to use GNU parallel to compare two directory trees, using -N 1 as advised above to process checksums line by line:

(cd /path/do/dirA; find -type f | parallel --bar 'sha256sum -b') \
| (cd /path/to/dirB; parallel --pipe -N1 'sha256sum -c -')

The desired behavior is that the second parallel should launch exactly one process for each line of input, without unnecessary delays or buffering. (The files are large enough that the fork/exec overhead does not matter.)

However, all I see is a bunch of perl processes endlessly waiting for data.


There appears to be a --block option which suggests that the processes might be waiting for 1M worth of data:

--block size
--block-size size

Size of block in bytes to read at a time.

The size can be postfixed with K, M, G, T, P, k, m, g, t, or p.

GNU parallel tries to meet the block size but can be off by the length of one record. For performance reasons size should be bigger than a two records. GNU parallel will warn you and automatically increase the size if you choose a size that is too small.

If you use -N, --block should be bigger than N+1 records.

size defaults to 1M.

However, the help text for option -N suggests that it is a (slower) alternative to --block, i.e. by specifying it I'm expecting to not have to care about the block size:

--max-replace-args max-args
-N max-args

<...>

When used with --pipe -N is the number of records to read. This is somewhat slower than --block.


Now this might mean that I need to try to fiddle with the --block size, but this feels unnecessary. I do not care about any blocks of predetermined size, I only care about whole lines. Thus, the question:

How do I make parallel --pipe process input in lines without any sort of arbitrary delay or buffering, using the equivalent of fgets()?

intelfx
  • 196

1 Answers1

1

You are indeed right: You need to fiddle with --block. GNU Parallel reads pipes in blocks of size --block. And in this block it searches for the lines. This is why --block without -N is faster.

So what you see is the second GNU Parallel reading a full 1 MB, before it starts its first job. And the full list of filenames is probably not even 1 MB, so the checking will only start after the first parallel completes.

This works for me:

(cd dirA; find -type f | parallel --bar sha256sum) |
  (cd dirB; parallel --block 30 --pipe -N1 sha256sum -c)

If you get:

parallel: Warning: A record was longer than 30. Increasing to --blocksize 40.

it simply means, that GNU Parallel has adjusted --block 30 to --block 40 because it could not find a full record in a single block. It is no cause for alarm, but you should feel free to change it to --block 40.

This is of course dead slow if the pipe contains TB of data, but in your case it only contains file names, so you can spare the extra clock cycles:

# 100-1000 MB/s
yes `seq 10000` | pv | parallel --block 30M --pipe 'cat >/dev/null'
# 1-10 KB/s
yes `seq 10` | pv | parallel --block 30 -N1 --pipe 'cat >/dev/null'
Ole Tange
  • 5,099