Hachoir for developers¶
Download code and run tests¶
To clone Git repository, type:
git clone https://github.com/vstinner/hachoir
Enter hachoir directory:
To run tests, type tox:
If tox is not already installed, install tox:
python3 -m pip install tox
See also Hachoir on the Travis CI.
Why using Hachoir parsers?¶
Why using slow Python code instead of fast hardcoded C code? Hachoir has many interesting features:
- Autofix: Hachoir is able to open invalid / truncated files
- Lazy: Open a file is very fast since no information is read from file, data are read and/or computed when the user ask for it
- Types: Hachoir has many predefined field types (integer, bit, string, etc.) and supports string with charset (ISO-8859-1, UTF-8, UTF-16, …)
- Addresses and sizes are stored in bit, so flags are stored as classic fields
- Endian: You have to set endian once, and then number are converted in the right endian
- Editor: Using Hachoir representation of data, you can edit, insert, remove data and then save in a new file.
Hachoir Metadata Example¶
from hachoir.parser import createParser from hachoir.metadata import extractMetadata from sys import argv, stderr, exit if len(argv) != 2: print("usage: %s filename" % argv, file=stderr) exit(1) filename = argv parser = createParser(filename) if not parser: print("Unable to parse file", file=stderr) exit(1) with parser: try: metadata = extractMetadata(parser) except Exception as err: print("Metadata extraction error: %s" % err) metadata = None if not metadata: print("Unable to extract metadata") exit(1) for line in metadata.exportPlaintext(): print(line)
hachoir.stream: Stream manipulation¶
To split data we first need is to get data :-) So this section presents the “hachoir.stream” API.
In most cases we work on files using the FileInputStream function. This function takes one argument: a Unicode filename. But for practical reasons we will use StringInputStream function in this documentation:
>>> data = b"point\0\3\0\2\0" >>> from hachoir.stream import StringInputStream, LITTLE_ENDIAN >>> stream = StringInputStream(data) >>> stream.source '<string>' >>> len(data), stream.size (10, 80) >>> data[1:6], stream.readBytes(8, 5) (b'oint\x00', b'oint\x00') >>> data[6:8], stream.readBits(6*8, 16, LITTLE_ENDIAN) (b'\x03\x00', 3) >>> data[8:10], stream.readBits(8*8, 16, LITTLE_ENDIAN) (b'\x02\x00', 2)
First big difference between a string and a Hachoir stream is that sizes and addresses are written in bits and not bytes. The difference is a factor of eight, that’s why we write “6*8” to get the sixth byte for example. You don’t need to know anything else to use Hachoir, so let’s play with fields!
hachoir.field: Field manipulation¶
We will parse the data used in the last section:
>>> from hachoir.field import Parser, CString, UInt16 >>> class Point(Parser): ... endian = LITTLE_ENDIAN ... def createFields(self): ... yield CString(self, "name", "Point name") ... yield UInt16(self, "x", "X coordinate") ... yield UInt16(self, "y", "Y coordinate") ... >>> point = Point(stream) >>> for field in point: ... print("%s) %s=%s" % (field.address, field.name, field.display)) ... 0) name="point" 48) x=3 64) y=2
point is a the root of our field tree. This tree is really simple, it just has one level and three fields: name, x and y. Hachoir stores a lot of information in each field. In this example we just show the address, name and display attributes. But a field has more attributes:
>>> x = point["x"] >>> "%s = %s" % (x.path, x.value) '/x = 3' >>> x.parent == point True >>> x.description 'X coordinate' >>> x.index 1 >>> x.address, x.absolute_address (48, 48)
The index is not the index of a field in a parent field list, ‘1’ means that it’s the second since the index starts at zero.
Parser with sub-field sets¶
After learning basic API, let’s see a more complex parser: parser with sub-field sets:
>>> from hachoir.field import FieldSet, UInt8, Character, String >>> class Entry(FieldSet): ... def createFields(self): ... yield Character(self, "letter") ... yield UInt8(self, "code") ... >>> class MyFormat(Parser): ... endian = LITTLE_ENDIAN ... def createFields(self): ... yield String(self, "signature", 3, charset="ASCII") ... yield UInt8(self, "count") ... for index in range(self["count"].value): ... yield Entry(self, "point") ... >>> data = b"MYF\3a\0b\2c\0" >>> stream = StringInputStream(data) >>> root = MyFormat(stream)
This example presents many interesting features of Hachoir. First of all, you can see that you can have two or more levels of fields. Here we have a tree with two levels:
>>> def displayTree(parent): ... for field in parent: ... print(field.path) ... if field.is_field_set: displayTree(field) ... >>> displayTree(root) /signature /count /point /point/letter /point/code /point /point/letter /point/code /point /point/letter /point/code
A field set is also a field, so it has the same attributes than another field (name, address, size, path, etc.) but has some new attributes like stream or root.
Hachoir is written in Python so it should be slow and eat a lot of CPU and memory, and it does. But in most cases, you don’t need to explore an entire field set and read all values; you just need to read some values of some specific fields. Hachoir is really lazy: no field is parsed before you ask for it, no value is read from stream before you read a value, etc. To inspect this behaviour, you can watch “current_length” (number of read fields) and “current_size” (current size in bits of a field set):
>>> root = MyFormat(stream) # Rebuild our parser >>> (root.current_length, root.current_size) (0, 0) >>> print(root["signature"].display) "MYF" >>> (root.current_length, root.current_size, root["signature"].size) (1, 24, 24)
Just after its creation, a parser is empty (0 fields). When we read the first field, its size becomes the size of the first field. Some operations requires to read more fields:
>>> print(root["point/letter"].display) 'a' >>> (root.current_length, root.current_size) (3, 48)
Reading point needs to read field “count”. So root now contains three fields.
List of field types¶
- Bit: one bit (True/False) ;
- Bits: unsigned number with a size in bits ;
- Bytes: vector of know bytes (e.g. file signature) ;
- UInt8, UInt16, UInt24, UInt32, UInt64: unsigned number (size: 8, 16, … bits) ;
- Int8, Int16, Int24, Int32, Int64: signed number (size: 8, 16, … bits) ;
- Float32, Float64, Float80: IEEE 754 floating point number (32, 64, 80 bits) ;
- Character: 8 bits ASCII character ;
- String: fixed length string ;
- CString: string ending with nul byte (“\0”) ;
- UnixLine: string ending with new line character (“\n”) ;
- PascalString8, PascalString16 and PascalString32: string prefixed with length in a unsigned 8 / 16 / 32 bits integer (use parent endian) ;
Timestamp (date and time):
- TimestampUnix32, TimestampUnix64: 32/64 bits UNIX, number of seconds since the January 1st 1970 ;
- TimestampMac32: 32-bit Mac, number of seconds since the January 1st 1904 ;
- TimestampWin64: 64-bit Windows, number of 1/10 microseconds since the January 1st 1600 ;
- DateTimeMSDOS32 and TimeDateMSDOS32: 32-bit MS-DOS structure, since the January 1st 1980.
- TimedeltaWin64: 64-bit Windows, number of 1/10 microseconds
Padding and raw bytes:
- PaddingBits/PaddingBytes: padding with a size in bits/bytes ;
- NullBits/NullBytes: null padding with a size in bits/bytes ;
- RawBits/RawBytes: unknown content with a size in bits/bytes.
- SubFile: a file contained in the stream ;
To create your own type, you can use:
- GenericInteger: integer ;
- GenericString: string ;
- FieldSet: Set of other fields ;
- Parser: The main class to parse a stream.
Field(parent, name, size=None, description=None)¶
Method called by code like “if field: (…)”. Always returns True
Check whether a field set contains the provided field.
Parameters: key (str) – The path to the field. Returns: bool
__init__(parent, name, size=None, description=None)¶
Set default class attributes, set right address if None address is given.
- parent (Field) – Parent field of this field
- name (str) – The name of the field, it must be unique in parent. If it ends with , end will be replaced with “[new_id]” (eg. “raw” becomes “raw”, next will be “raw”, and then “raw”, etc.)
- size (int, optional) – Size of the field in bits. If None then it will be computed later.
- address (int, optional) – Address in bit relative to the parent absolute address.
- description (str, optional) – String description
Absolute address (from beginning of stream), in bits.
Relative address to parent address, in bits.
Override in derived classes to provide
Informal description of this field. Cached.
The description of a field may provide a general summary of its usage or for field sets it can be used to give a short indication of the contents without having to expand the node.
Short, human-friendly string representing field contents.
- key (str) – relative or absolute path for the desired field.
- const (bool) – For field sets, whether to consume additional input to find a matching field.
The field matching the provided path.
Returns: an input stream containing the field content. Return type: InputFieldStream
bool: Check if field has a value.
index of the field in parent field set, starting from 0.
True if this field contains other fields (ie. is a field set), False otherwise.
Field name, unique in its parent field set list.
Parent of this field.
Full path of this field starting from the root field.
Represents raw field content
Parameters: createInputStream (
callback(cis, **args)) – Function to use in place of
_createInputStream(). Receives the previous value of
_createInputStreamas its first argument, for chaining.
Size of this field, in bits. Cached.
(optional) Helper to compute field size.
- May have types:
None: field size is computed dynamically. int: field size, in bits. callable: function that receives the same arguments as the constructor,without
Value of field.
Field set class¶
Read only attributes:
- endian: value is BIG_ENDIAN or LITTLE_ENDIAN, the way the bits are written in input stream <~~ can be replaced
- stream (InputStream): input stream
- root (FieldSet): root of all fields
- eof (bool): End Of File: are we at the end of the input stream?
- done (bool): The parser is done or not?
Read only and lazy attributes:
- current_size (long): Current size in bits
- current_length (long): Current number of children
- connectEvent(event, handler, local=True): connect an handler to an event
- raiseEvent(event, *args): raise an event
- reset(): clear all caches but keep its size if we know it
- setUniqueFieldName(): for field with name ending with “”, replaces “” with an unique identifier like, “item” => “item”.
- seekBit(address, …): create a field to seek to specified address or returns None if we are already there
- seekByte(address, …): create a field to seek to specified address or returns None if we are already there
- replaceField(name, fields): replace a field with one or more fields <~~~ I don’t like this method :-(
- getFieldByAddress(address, feed=True): get the field at the specified address
- writeFieldsIn(old, address, new): helper for replaceField() <~~~ can be an helper?
- getFieldType(): get the field type as a short string. The type may contains extra informations like the string charset.
- array(): create a FakeArray to easily get a field by its index (see FakeArray API to get more details)
- __len__(): number of children in the field set
- readFirstFields(number): read first ‘number’ fields, returns number of new fields
- readMoreFields(number): read more ‘number’ fields, returns number of new fields
- __iter__(): iterate over children
- createFields(): main function of the parser, create the fields. Don’t call this function directly.
Change Hachoir verbosity, make Hachoir quiet:
from hachoir.core import config config.quiet = True