Many thanks to Dr. Luciana Buriol, I recently updated the code to fix a few bugs. Please find the recent version of code here. The result from a test run:
Min-wise Independent Permutation Families are newly defined on [1]. One application of such a family (under some relaxations) is essential to the algorithm used in practice by the AltaVista web index software to detect and filter nearduplicate documents. There are also potential applications of such families in de-randomization. A permutation family F (subset of all permutations over [1..n]) is exactly min-wise independent if for any subset X of [1..n], and any x in X, when p is chosen at random from F we have
Pr{min{p(X)}=
p(x)}=1/|X|.
In practice, we can and sometimes have to allow certain relaxations. One of the relaxed definition is approximate restricted min-wise permutation family. We say that F is approximately minwise independent with relative error e, restricted for sets up to size k, if for any set subset X of [1..n] with |X|<k and any x in X, when p is chosen at random in F we have
|Pr{min{p(X)}= p(x)}-1/|X||<e/|X|.
This project implements an permutation generator to construct an approximate min-wise (1/2^err away), restricted (by 2^k) permutation family for [0..2^n-1], given parameters: n, k, err. This C++ implementation consists of several classes:
For given parameters n, k, and err, we construct hi to be almost ki-wise independent hashing function from [0..2^n-1] to [0..2^ri-1], where
k1=k
k(i+1)=ki-1
We choose ri so that hi maps any set of size 2^ki into [2^ri] in such a way so that no bucket has size greater than 2^k(i+1) with probability at least 1-1/(2*2^e*k). hi will be close enough to ki-wise independent so that the difference adds a error probability 1/(2*k*2^e).
ri=2+2/ki*(e+1+log2(k))
hi can decide the first significant ri bits for a input x over the permutation. Each hashing function can be implemented by get a string from an almost 2^ki*ri-wise independent distribution on 2^n*ri bits.
The space cost for this construction will be O(loglogN + logK + log(1/E)), where N=2^n, K=2^k, E=1/2^err.
We use a LFSR implementation to generate the distribution. A LFSR is defined as:
Given two m-bit sequence s and f. the shift register sequence generated by feedback rule f and start sequence s is
ri=si, if i<m
or sum{fj*r(I-m+j)} for j=0...m-1, if i>=m
In [2] the authors states that if f is a non-degenerate, i.e., f0=1 and f(t)=t^m+sum{fj*t^j} 0<j<m is a reducible polynomial, such m-LFSR family can generate approx (e away in L1 form) k-wise Independent distribution over n bits, if m>=k*2log(k*log(n)/2e).
The space cost for this construction will be O(k+2log(k*log(N)/(2*e)).
Since irreducible polynomials are quite
common (somewhat like prime numbers), it is quite easy to find them using the
following efficient algorithm for testing for irreducibility over GF(2):
int poly_irreducible( const vlong & x )
{ unsigned d = x.bits()-1; vlong u = 2; for (unsigned i=1;i<d;i+=1) { u = poly_rem( poly_mul(u,u), x ); vlong g = poly_gcd( u^2, x ); if ( g != 1 ) return 0; } return 1;}
“Here poly_mul, poly_rem denote
polynomial multiplication and remainder, and poly_gcd finds the greatest common
factor of two polynomials (essentially Euclid's algorithm). Note that the function
is using a mixture of normal 2's complement arithmetic and polynomial routines
in a slightly obscure way; x.bits() is 1+log2(x), and ^ is the 'C' exclusive-or
operation not exponentiation.”
The Time complexity is O(d^3) for
testing a polynomial with degree d.
In this implementation, we use an implementation of the "minimal standard" multiplicative linear congruential generator from [4]. Also there is a piece of ugly written code to test the permutation generator.
To store an approx restricted min-wise permutation, the space cost is O(loglogN + logK + log(1/E)), where N=2^n, K=2^k, E=1/2^err.
To construct an approx restricted min-wise permutation, the time cost is in the same order as O(loglogN + logK + log(1/E)), where N=2^n, K=2^k, E=1/2^err. To compute the permuted position map(x) for input x, the cost is O(N*(loglogN+logK+log(1/E)).
Please note this implementation serves mainly as a class project to understand the concept of min-wise independent permutation. Currently, it can only handle n<32, also the space cost is not well managed. A more practical version can be implemented upon well-done mathematical libraries, such as NTL[5] or matlab. The following figure shows how well this implementation approximates min-wise independent permutation. “eps-n-k-err-t” is defined as
Eps-n-k-err=|Pr{min{p(X)}=
p(0)}*|X|-1|.
Where X={0,1,…,2^k-1}
Pr{min{p(X)}= p(0)} is the observed probability for t permutations constructed using mwPermutation class.
Ideally, eps-n-k-err-t will be bounded by 1/2^err after a large number of trials. This characteristic can be observed clearly by “eps-8-3-5-4000” and “eps-10-4-5-4000”. However for larger n, k, e.g. “eps-10-5-5-4000”, this is not obvious, partially because our 4000 sampling trials only cover a very small faction of permutation space.

1. Min-wise independent Permutations, Andrei Z. Broder, Moses Charikar, Alan M. Frieze, Michael Mitzenmacher, 1999
2. Simple Constructions of Almost k-wise Independent Random Variables, Noga Alon, Oded Goldreich, Johan, Hastad, Rene Peralta, 1992
3. Elliptic curve cryptography FAQ v1.12 22nd December 1997
4. "Random Number Generators: Good Ones are Hard to Find", Park, S.K. and Miller, K.W., CACM 31:10, Oct. 88, pp. 1192-1201.
5. NTL: A Library for doing Number Theory, Victor Shoup, http://www.shoup.net/ntl/
Comments
and Suggestions are welcome. Jerry Zhao.
2000/05/04