Struct regex_automata::dfa::onepass::Config

source ·

pub struct Config { /* private fields */ }

Expand description

The configuration used for building a one-pass DFA.

A one-pass DFA configuration is a simple data object that is typically used with Builder::configure. It can be cheaply cloned.

A default configuration can be created either with Config::new, or perhaps more conveniently, with DFA::config.

Implementations§

source §

impl Config

source

pub fn new() -> Config

Return a new default one-pass DFA configuration.

source

pub fn match_kind(self, kind: MatchKind) -> Config

Set the desired match semantics.

The default is MatchKind::LeftmostFirst, which corresponds to the match semantics of Perl-like regex engines. That is, when multiple patterns would match at the same leftmost position, the pattern that appears first in the concrete syntax is chosen.

Currently, the only other kind of match semantics supported is MatchKind::All. This corresponds to “classical DFA” construction where all possible matches are visited.

When it comes to the one-pass DFA, it is rarer for preference order and “longest match” to actually disagree. Since if they did disagree, then the regex typically isn’t one-pass. For example, searching Samwise for Sam|Samwise will report Sam for leftmost-first matching and Samwise for “longest match” or “all” matching. However, this regex is not one-pass if taken literally. The equivalent regex, Sam(?:|wise) is one-pass and Sam|Samwise may be optimized to it.

The other main difference is that “all” match semantics don’t support non-greedy matches. “All” match semantics always try to match as much as possible.

source

pub fn starts_for_each_pattern(self, yes: bool) -> Config

Whether to compile a separate start state for each pattern in the one-pass DFA.

When enabled, a separate anchored start state is added for each pattern in the DFA. When this start state is used, then the DFA will only search for matches for the pattern specified, even if there are other patterns in the DFA.

The main downside of this option is that it can potentially increase the size of the DFA and/or increase the time it takes to build the DFA.

You might want to enable this option when you want to both search for anchored matches of any pattern or to search for anchored matches of one particular pattern while using the same DFA. (Otherwise, you would need to compile a new DFA for each pattern.)

By default this is disabled.

§Example

This example shows how to build a multi-regex and then search for matches for a any of the patterns or matches for a specific pattern.

use regex_automata::{
    dfa::onepass::DFA, Anchored, Input, Match, PatternID,
};

let re = DFA::builder()
    .configure(DFA::config().starts_for_each_pattern(true))
    .build_many(&["[a-z]+", "[0-9]+"])?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let haystack = "123abc";
let input = Input::new(haystack).anchored(Anchored::Yes);

// A normal multi-pattern search will show pattern 1 matches.
re.try_search(&mut cache, &input, &mut caps)?;
assert_eq!(Some(Match::must(1, 0..3)), caps.get_match());

// If we only want to report pattern 0 matches, then we'll get no
// match here.
let input = input.anchored(Anchored::Pattern(PatternID::must(0)));
re.try_search(&mut cache, &input, &mut caps)?;
assert_eq!(None, caps.get_match());

source

pub fn byte_classes(self, yes: bool) -> Config

Whether to attempt to shrink the size of the DFA’s alphabet or not.

This option is enabled by default and should never be disabled unless one is debugging a one-pass DFA.

When enabled, the DFA will use a map from all possible bytes to their corresponding equivalence class. Each equivalence class represents a set of bytes that does not discriminate between a match and a non-match in the DFA. For example, the pattern [ab]+ has at least two equivalence classes: a set containing a and b and a set containing every byte except for a and b. a and b are in the same equivalence class because they never discriminate between a match and a non-match.

The advantage of this map is that the size of the transition table can be reduced drastically from (approximately) #states * 256 * sizeof(StateID) to #states * k * sizeof(StateID) where k is the number of equivalence classes (rounded up to the nearest power of 2). As a result, total space usage can decrease substantially. Moreover, since a smaller alphabet is used, DFA compilation becomes faster as well.

WARNING: This is only useful for debugging DFAs. Disabling this does not yield any speed advantages. Namely, even when this is disabled, a byte class map is still used while searching. The only difference is that every byte will be forced into its own distinct equivalence class. This is useful for debugging the actual generated transitions because it lets one see the transitions defined on actual bytes instead of the equivalence classes.

source

pub fn size_limit(self, limit: Option<usize>) -> Config

Set a size limit on the total heap used by a one-pass DFA.

This size limit is expressed in bytes and is applied during construction of a one-pass DFA. If the DFA’s heap usage exceeds this configured limit, then construction is stopped and an error is returned.

The default is no limit.

§Example

This example shows a one-pass DFA that fails to build because of a configured size limit. This particular example also serves as a cautionary tale demonstrating just how big DFAs with large Unicode character classes can get.

use regex_automata::{dfa::onepass::DFA, Match};

// 6MB isn't enough!
DFA::builder()
    .configure(DFA::config().size_limit(Some(6_000_000)))
    .build(r"\w{20}")
    .unwrap_err();

// ... but 7MB probably is!
// (Note that DFA sizes aren't necessarily stable between releases.)
let re = DFA::builder()
    .configure(DFA::config().size_limit(Some(7_000_000)))
    .build(r"\w{20}")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let haystack = "A".repeat(20);
re.captures(&mut cache, &haystack, &mut caps);
assert_eq!(Some(Match::must(0, 0..20)), caps.get_match());

While one needs a little more than 3MB to represent \w{20}, it turns out that you only need a little more than 4KB to represent (?-u:\w{20}). So only use Unicode if you need it!

source

pub fn get_match_kind(&self) -> MatchKind

Returns the match semantics set in this configuration.

source

pub fn get_starts_for_each_pattern(&self) -> bool

Returns whether this configuration has enabled anchored starting states for every pattern in the DFA.

source

pub fn get_byte_classes(&self) -> bool

Returns whether this configuration has enabled byte classes or not. This is typically a debugging oriented option, as disabling it confers no speed benefit.

source

pub fn get_size_limit(&self) -> Option<usize>

Returns the DFA size limit of this configuration if one was set. The size limit is total number of bytes on the heap that a DFA is permitted to use. If the DFA exceeds this limit during construction, then construction is stopped and an error is returned.