hyperscan介绍

首席工程师: How we match regular expressions

Hyperscan is an automata-based (e.g. NFA/DFA) style approach rather than a back-tracking approach. The automata-based approach yields advantages and disadvantages: on the plus side, an automata-based approach is amenable to streaming and handling of multiple regular expressions. On the minus side, automata-based regular expression matching cannot handle some regular expression constructs easily, or at all – backreferences and arbitrary lookaround asserts are notable regular expression features that we do not support.

Hyperscan makes use of many different techniques to try to make the regular expression matching task tractable for large numbers of regular expressions. We have not found a single, elegant automata approach that handles arbitrary regular expressions in arbitrary number – although we are still looking! Instead, we have a lot of optimizations in order to try to extract the best possible performance whether we have one simple regular expression or tens of thousands.

Some of these techniques include:

* Discovery of literal (fixed string) factors and decomposition of regular
  expressions into smaller chunks (which we call "engines") separated by these
  literal factors.
* These engines can be of many different types:
  * Deterministic Finite Automata (DFA)
  * Bit-parallel Glushkov Non-deterministic Finite Automata (NFA) engines
  * Custom engines for special cases (such as large bounded repeats).
    * These engines can take many different roles:
      * "Prefix" engines that precede our literal factors
      * "Suffix" engines that follow our literal factors
      * "Infix" engines that lie between two literal factors
      * "Outfix" engines that aren’t connected at all with literal factors
        (when no satisfactory factors can be found in a regular expression)
    * These engines can often run lazily or not at all to reduce overhead.
* We merge smaller DFA/NFA engines into larger ones, where this can be done
  without performance loss.
* SIMD "acceleration" of automata based scanning: where we can substitute
  relatively simple SIMD tests of our input for complex automata execution,
  we do it.
* We use Intel SIMD instructions to handle larger-scale NFA and literal matching
  tasks: having 128 or 256 (or more) bits for a bit-parallel automaton is often helpful.
* ... and many more short-cuts to attempt to avoid doing expensive automata
  calculations that we ultimately won’t need.

Hyperscan is very much a work in progress – from 2008, we have been rapidly iterating the product and supporting customers. There are many inelegant bits to our code – sometimes because the real world seems to dictate inelegant solutions (e.g. corner cases), sometimes because we haven’t had a chance to clean things up.

Many parts of our codebase are earmarked for cleanup and streamlining, and if something looks peculiar to you, please call it out – it might look peculiar because it is peculiar and be amenable to a simple fix. Bear in mind the following, though:

Hyperscan is trying to accomplish a lot of goals (some of which are contradictory):

  • We’d like to be fast – providing high levels of performance under normal use cases and in boundary conditions (e.g. short writes, floods of single characters, worst case inputs)
  • We’d like to have a small bytecode (our read-only pattern database)
  • We’d like to have a small stream state if we’re running in streaming mode (our ‘per stream’ cost – if we use 192 bytes for a pattern set then it’s going to take 1920Mbytes to support 10M streams, of course)

We also have some constraints:

  • Our run-time must be written in C, as some data-plane environments don’t support C++
  • We cannot make arbitrary requests for memory in our run-time; we can only work within our bytecode (read-only and fixed at compile time), scratch space and stream state space.
  • Our bytecode must be serializable to a flat representation and needs to be able to be moved from memory location to memory location (i.e. no pointers internally)
  • Our data structures – especially stream states and inputs – may be allocated ‘next door’ to parts of memory that cannot be read, so we must be careful to not use large data reads outside our input or stream state areas.
  • As a rule, we prefer to support constructs only if they can be implemented efficiently in streaming mode as well as block mode, and prefer to take an all-or-nothing approach to supporting whole classes of constructs.

Parts of the regular expression task are unsolved in Hyperscan due to a lack of demand thus far in our target markets – but are not necessarily unsolvable in the Hyperscan framework. We think that there are a lot of possibilities to improve Hyperscan’s support of some regular expression constructs (lookaround asserts, particularly) and regular expression functionality (capturing subexpressions and accurate simulation of the behavior of quantifiers in backtracking engines).

源码结构

这里的源码指hyperscan软件包解压后的src目录中的内容.

目录树:

src
 ├─compiler    #
 ├─fdr         #
 ├─hwlm        #
 ├─nfa         #
 ├─nfagraph    #
 ├─parser      #
 ├─rose        #
 ├─smallwrite  #
 ├─som         #
 └─util        #

src目录下还有一些文件, 包括基本结构体的定义, 对外接口等.