hachoir.regex module

hachoir.regex is a Python library for regular expression (regex or regexp) manupulation. You can use a|b (or) and a+b (and) operators. Expressions are optimized during the construction: merge ranges, simplify repetitions, etc. It also contains a class for pattern matching allowing to search multiple strings and regex at the same time.

Regex examples

Regex are optimized during their creation:

>>> from hachoir.regex import parse, createRange, createString
>>> createString("bike") + createString("motor")
<RegexString 'bikemotor'>
>>> parse('(foo|fooo|foot|football)')
<RegexAnd 'foo(|[ot]|tball)'>

Create character range:

>>> regex = createString("1") | createString("3")
>>> regex
<RegexRange '[13]'>
>>> regex |= createRange("2", "4")
>>> regex
<RegexRange '[1-4]'>

As you can see, you can use classic “a|b” (or) and “a+b” (and) Python operators. Example of regular expressions using repetition:

>>> parse("(a{2,}){3,4}")
<RegexRepeat 'a{6,}'>
>>> parse("(a*|b)*")
<RegexRepeat '[ab]*'>
>>> parse("(a*|b|){4,5}")
<RegexRepeat '(a+|b){0,5}'>

Compute minimum/maximum matched pattern:

>>> r=parse('(cat|horse)')
>>> r.minLength(), r.maxLength()
(3, 5)
>>> r=parse('(a{2,}|b+)')
>>> r.minLength(), r.maxLength()
(1, None)

Pattern maching

Use PatternMaching if you would like to find many strings or regex in a string. Use addString() and addRegex() to add your patterns:

>>> from hachoir.regex import PatternMatching
>>> p = PatternMatching()
>>> p.addString("a")
>>> p.addString("b")
>>> p.addRegex("[cd]")

And then use search() to find all patterns:

>>> for start, end, item in p.search("a b c d"):
...    print("%s..%s: %s" % (start, end, item))
...
0..1: a
2..3: b
4..5: [cd]
6..7: [cd]

You can also attach an object to a pattern with ‘user’ (user data) argument:

>>> p = PatternMatching()
>>> p.addString("un", 1)
>>> p.addString("deux", 2)
>>> for start, end, item in p.search("un deux"):
...    print("%r at %s: user=%r" % (item, start, item.user))
...
<StringPattern 'un'> at 0: user=1
<StringPattern 'deux'> at 3: user=2

Create regular expressions

There is two ways to create regular expressions: use string or directly use the API.

Atom classes:

  • RegexEmpty: empty regex (match nothing)

  • RegexStart, RegexEnd, RegexDot: symbols ^, $ and .

  • RegexString

  • RegexRange: character range like [a-z] or [^0-9]

  • RegexAnd

  • RegexOr

  • RegexRepeat

All classes are based on Regex class.

Create regex with string

>>> from hachoir.regex import parse
>>> parse('')
<RegexEmpty ''>
>>> parse('abc')
<RegexString 'abc'>
>>> parse('[bc]d')
<RegexAnd '[bc]d'>
>>> parse('a(b|[cd]|(e|f))g')
<RegexAnd 'a[b-f]g'>
>>> parse('([a-z]|[b-])')
<RegexRange '[a-z-]'>
>>> parse('^^..$$')
<RegexAnd '^..$'>
>>> parse('chats?')
<RegexAnd 'chats?'>
>>> parse(' +abc')
<RegexAnd ' +abc'>

Create regex with the API

>>> from hachoir.regex import createString, createRange
>>> createString('')
<RegexEmpty ''>
>>> createString('abc')
<RegexString 'abc'>
>>> createRange('a', 'b', 'c')
<RegexRange '[a-c]'>
>>> createRange('a', 'b', 'c', exclude=True)
<RegexRange '[^a-c]'>

Manipulate regular expressions

Convert to string:

>>> from hachoir.regex import createRange, createString
>>> str(createString('abc'))
'abc'
>>> repr(createString('abc'))
"<RegexString 'abc'>"

Operatiors “and” and “or”:

>>> createString("bike") & createString("motor")
<RegexString 'bikemotor'>
>>> createString("bike") | createString("motor")
<RegexOr '(bike|motor)'>

You can also use operator “+”, it’s just an alias to a & b:

>>> createString("big ") + createString("bike")
<RegexString 'big bike'>

Compute minimum/maximum matched pattern:

>>> r=parse('(cat|horse)')
>>> r.minLength(), r.maxLength()
(3, 5)

Optimizations

The library includes many optimization to keep small and fast expressions.

Group prefix:

>>> createString("blue") | createString("brown")
<RegexAnd 'b(lue|rown)'>
>>> createString("moto") | parse("mot.")
<RegexAnd 'mot.'>
>>> parse("(ma|mb|mc)")
<RegexAnd 'm[a-c]'>
>>> parse("(maa|mbb|mcc)")
<RegexAnd 'm(aa|bb|cc)'>

Merge ranges:

>>> from hachoir.regex import createRange
>>> regex = createString("1") | createString("3"); regex
<RegexRange '[13]'>
>>> regex = regex | createRange("2"); regex
<RegexRange '[1-3]'>
>>> regex = regex | createString("0"); regex
<RegexRange '[0-3]'>
>>> regex = regex | createRange("5", "6"); regex
<RegexRange '[0-356]'>
>>> regex = regex | createRange("4"); regex
<RegexRange '[0-6]'>

PatternMaching class

Use PatternMaching if you would like to find many strings or regex in a string. Use addString() and addRegex() to add your patterns:

>>> from hachoir.regex import PatternMatching
>>> p = PatternMatching()
>>> p.addString("a")
>>> p.addString("b")
>>> p.addRegex("[cd]")

And then use search() to find all patterns:

>>> for start, end, item in p.search("a b c d"):
...    print("%s..%s: %s" % (start, end, item))
...
0..1: a
2..3: b
4..5: [cd]
6..7: [cd]

Item is a Pattern object, not the matched string. To be exact, it’s a StringPattern for string and a RegexPattern for regex. You can associate an “user” value to each Pattern object:

>>> p2 = PatternMatching()
>>> p2.addString("un", 1)
>>> p2.addString("deux", 2)
>>> p2.addRegex("(trois|three)", 3)
>>> for start, end, item in p2.search("un deux trois"):
...    print("%r at %s: user=%r" % (item, start, item.user))
...
<StringPattern 'un'> at 0: user=1
<StringPattern 'deux'> at 3: user=2
<RegexPattern 't(rois|hree)'> at 8: user=3

You can associate any Python object to an item, not only an integer!