Regular Expressions and Prime Factors

Mike
2024-01-27

How do you explain divisibility of numbers to a six-year old? Cuisenaire Rods provide a haptic, pre-numeracy means of conceptualizing division (rods of identical length $m$ are added $k$ times to equal a rod of given length $n$) and divisibility (for which $n$ and $m$ can this task be solved?).

So, put a rod of length $n$ on the table, then begin adding rods length $m=2$. If these do not fit, continue with $m=3$, and so on.

The number $9$ is not divisible by $2$, but by $3$.

The number $7$ can be divided by no number $1<m<7$, thus is prime.

Can we automate this activity using an equivalent process on a computer, say, e.g regular expressions? Consider the following Python function:

from collections import Counter
import re
def f_re(n):
  f = Counter()
  while (m := re.fullmatch(r'(11+?)\1*', '1'*n)):
    l = len(m.group(1))
    f[l] += 1
    n //= l
  return dict(f)

Guess what it is doing…

How it began: the following curiosity showed up in my feed: the regular expression
^1?$|^(11+?)\1+$
can be used to test if a number $n$ is prime. The explanation is simple and equivalent to the Cuisenaire rods primality test presented above: after converting a number $n$ into its unary string representation $s_n={\tt 111}\ldots {\tt 1}$ ($n$ times), you check if a substring $s_m$ of length $1<m<n$ exists that evenly divides into $s_n$. If so, $n$ is composite, else it is prime. This is what the right-hand alternative of this regex tests. The left-hand alternative treats the special cases $n=0,1$, which are non-prime by definition.

The labor of searching for $m$ is left to the regex engine. The ? in (11+?) makes this pattern non-greedy, so the search starts at $m=2,3\ldots$. The backreference \1+ requires that whatever (11+?) matched must repeat $k$ times for some $k\ge 1$ (this corresponds to using rods of equal length), and the anchors ^ and $ ensure that the pattern matches only the full string $s_n$. In Python, re.fullmatch() implies these anchors, so this can be further simplified into

import re
ip = re.compile(r'1?|(11+?)\1+')
def isprime_re(n):
  return not ip.fullmatch(n*'1')

Have you ever seen a more minimalist primality test? Obviously, we can expect this to be inefficient to the point of silliness. In particular for primes, the regex engine will supposedly¹ iterate over all integers $m=2,3,4\ldots n$, where a search $m=2,3,5,7\ldots\lfloor\sqrt{n}\rfloor,\; m\in\mathbb{P}$ would suffice.

How does this compare in practice? In what follows, we compare the execution times for non-greedy r'1?|(11+?)\1+' and greedy r'1?|(11+)\1+' regex variants in Python. Given that Perl’s regex has become a model for many other regex engines and is reputedly quite efficient, we also look at the Perl version

sub isprime_re {
  my ( $n ) = @_ ;
  ( "1" x $n ) !~ /^1?$|^(11+?)\1+$/ + 0 ;
}

(and analogously for the greedy variant), further a moderately simple primality test in Python that skips over multiples of 2, 3, and 5, based on an algorithm I originally found in a collection of HP calculator programs (HP 67/97 Math Pac page L01-02 “Factors and Primes”), also known as Wheel Factorization (here simplified for primality test),

from itertools import chain, cycle
def isprime(n):
  if n < 2:
    return False
  ni = chain([1,2,2],cycle([4,2,4,2,4,6,2,6]))
  i = 2
  while i**2 <= n:
    if n % i == 0:
      return False
    else:
      i += next(ni)
  return True

and finally Sympy’s built-in primality test sympy.ntheory.primetest.isprime(n). The following double-logarithmic graph shows execution times on an old Intel Xeon server running at 2.6 GHz for all these tests, along with auxiliary black lines to indicate $O(n)$ and $O(n^2)$ complexity.

As expected, the times for the wheel and sympy algorithms remain at least an order of magnitude below the regex variants. We will ignore those for the rest of this article. For each of the four regex implementations (Perl, Python combined with greedy, non-greedy) we see distinct clusters of points, showing clear differences in complexity. For Perl, greedy (orange) and non-greedy (blue) times are similar, and the non-greedy Python variant (green) outperforms Perl. All these appear to be roughly of complexity $O(n)$ as indicated by the black lines (maybe with an additional factor $\log(n)$, which would become obvious at much larger values of $n$). Python’s greedy execution times (red) are much higher and appear to have quadratic complexity $O(n^2)$.

The discontinuities in some of the quasi-curves are due to fluctuations in the execution of the program and are not reproducible. They could be eliminated by repeated runs taking the minimum execution time for each case.

The fact that for each method the data points do not form smooth curves but are aggregated in families of curve-like clusters reflects how the prime factor composition of these numbers affects the matching algorithm, as illustrated by the graph below. It animates² run times for composite numbers $n$ with increasing smallest prime factors, i.e. $n\in k\times\mathbb{P},\,k=2, 3, 5, 7, 11, 13, 17$ and prime numbers $n\in\mathbb{P}$.

This shows that the regex runtime depends on the smallest prime factor of the composite numbers, and that prime numbers take most time to test with – unsurprisingly – no difference between greedy and non-greedy patterns. The fact that for composite numbers, Perl’s run times differ little between greedy and none-greedy, whereas the greedy Python regex starts off with $O(n^2)$, shows that Python’s regex engine, while maybe modeled after Perl’s, clearly is an independent implementation.

… or will it? Let’s keep this question open for another exploration… ↩︎
The animation does not work on all browsers. ↩︎

Leave a comment Innumber theory perl python

Regular Expressions and Prime Factors

Leave a Reply