libpbf

File format based on Google protocol buffers. Suitable for storage of waveforms or other binned data.

View project onGitHub

Welcome to the protocol buffer file format documentation

This file format is designed for the storage of binned data, for example waveforms. Some of the key design features are:

  • Simple interface for writing and reading
  • Fast, dynamic compression for lightweight files
  • Internal data sorting allowing parallel input
  • Clean, easily extendable code

This short document should teach you all you need to know to use the class.

Please note: this code is currently under development. Most features should function as described here, but not everything has been fully implemented or tested. If you notice something wrong, file an issue report on github.

Installation

Fetch the code from github.

git clone https://github.com/coderdj/libpbf

There are a few dependencies. You need the build classes (a compiler, make, etc.), as well as the protocol buffer and snappy libraries from google. On Ubuntu you can install this with:

apt-get install build-essential libsnappy-dev libprotobuf-dev

Note that you also need libpthread, but if you are on Ubuntu (and probably other distros) this will likely already be installed.

Now go the top level directory type 'make'. The library should appear in the current directory. You can install the library with 'make install'. This puts it into /usr/local/lib and the includes in /usr/local/include/pbf. If you want it somewhere else set the PREFIX variable in the Makefile.

Writing to Files

The basic procedure for writing to files is:

  1. Open the file
  2. Open an event
  3. Put data into the event
  4. Close the event
  5. Go back to step 2 until you're out of data
  6. Close the file

You can have multiple events open at the same time. The events will be placed into an output buffer and sorted by time stamp when they are closed. Data within an event is sorted by channel number (or, optionally, by module/channel). Raw data within a channel can again be sorted by time stamp.

Let's look at each step individually. The code used here is also illustrated with an example file builder in the examples directory.

First, opening the file is done simply with the following code:

 pff_output outfile(); 
 int s = outfile.open_file(outputPath, outputOptions);     //outputPath and outputOptions are strings
 if(s!=0) cout<<"Error opening file"<<endl;

or, using the constructor:

pff_output *outfile;
try{
    outfile = new pff_output(outputPath, outputOptions);
} 
catch(...){
    cout<<"Error opening file!"<<endl;
}

The output path provided should be a "stub". The program will automatically append a suffix and extension onto the file name. So if you want your data stored in a subdirectory of the current directory called "data" with name "waveform" then make this string "data/waveform".

The options string allows setting a few options for how the files are saved. Options are separated with a ':' character. Possible options are:

  • z - zip output. This compresses the data blocks internally using snappy. When reading the file again later you don't have to know if it was compressed or not. This information is stored internally.
  • pz - data is already zipped. If you send compressed data to the file then you can set this option to have it flagged as compressed and automatically extracted when the file is read. Note that only snappy is supported, so if you compressed with something other than snappy do not set this option, you will have to decompress manually.
  • n{int} - gives the number of events per subfile. For one measurement it might be convenient to split the data up into many sub files with, for example, 1,000 events each. Then you can set this options as 'n1000'. The data will then be stored in files of the form: {filepath}{filenumber}.pff. Where filenumber is a 6-digit number indicating which block of 1,000 events are in the file. If you set this number so low or run so long that you exceed 999,999 files then all remaining data goes into the last file. Setting this number to zero puts all data into one file.
  • b{int} - sets the maximum number of open events in the internal buffer. If you exceed this number of open events the library will stop assigning file handles until some events are closed. This can be useful for managing memory if you anticipate situations where you will be writing into a large number of events simultaneously. It is left as a user option since the user knows best if he wants to write into 10,000 open events with size of 1kB each or if he can only have a few open events since the events are many MB in size. The default of 100 is probably fine if this confuses you.

Once your file is open, start adding events. You can open a new event using the following code:

int eventHandle=0;
int success = outfile.create_event(event_timestamp, eventHandle);
if(success!=0) cout<<"Could not create event."<<endl;

Here timestamp lets you specify the time of the event. This is an unsigned 64-bit integer. If it is set to zero the event is given the timestamp of the earliest data entry. If nothing is provided the time stamp is left at zero when the event is saved. This is not suggested, though the events will still be assigned unique ID numbers. If you do provide timestamps then the library can sort the events by these times to read them out.

The event handle is passed by reference to the create_event function and is returned as a unique integer. This number will be used for adding data to this specific event in the future. This handle is valid until the event is closed. After that point it will be recycled for future events.

To add data to an event, use the add_data function. The data is a binary dump, represented by a pointer to a char array. Please note that the program will make a copy of this data for its own use and also remove this copy when it is finished. The user is responsible for freeing any memory allocated by his own program. It is safe to delete or overwrite the data pointer immediately after adding the data to the event.

int success = outfile.add_data(eventHandle,channel,module,(char*)data,dataLength);
if(success==0) cout<<"Data added to event with handle "<<eventHandle<<endl;

Here channel and module are integers representing the specific data channel. The module parameter is optional and can be omitted (a second add_data function exists with only 4 arguments). The dataLength is a size_t type representing the length of the data field in bytes. The eventHandle must point to an open event.

If you have a time stamp for this particular block of data, this can also be added at the end of the function call as follows:

int success = outfile.add_data(eventHandle,channel,module,(char*)data,dataLength,timestamp_data);
if(success==0) cout<<"Data added with time stamp "<<timestamp_data<<endl;

The variable timestamp_data is an unsigned 64-bit integer. Even if the data is added outside of temporal order, it will be sorted within the channel by time stamp if this field is provided. This way when the file is read it will be read in time order.

After all data has been added to an event, it should be closed. This is done by invoking close_event as follows.

outfile.close_event(eventHandle, true);

The second argument is a Boolean telling the program if it should write the event right away. If your events are pre-sorted or if you don't care about the order this can always be true. If events could come outside of time order and they should be sorted, set it to false (or omit the argument since false is default). The event will be stored in memory. In order to write all events stored in memory, use the write call.

outfile.write();

The write call can also optionally take a 64-bit integer as an argument. This is a timestamp and will tell the program to write all events up to this timestamp to file. Please note that the internal buffer will blow up in size if the user does not call write. If the buffer gets too large the program will stop reading new events in until it is written out.

Note that calling write is not necessary if you have always been closing events with "true" in the second argument.

To close the file when everything is finished use:

outfile.close_file()

Reading Files

Files are read in event by event. The main procedure is this:

  1. Open a file
  2. Fetch the first event
  3. Read all data you want from this event
  4. Go back to step 2 for all subsequent events
  5. Close the fine when done

Note that you don't have to provide any attributes of the file (zipped or not, how many events per sub-file, etc.) since they are defined in the file header and automatically read.

To open a file, use the following syntax:

pff_input infile;
int success = infile.open_file(path);
if (success==0) cout<<"File opened successfully"<<endl;
else cout<<"File at "<<path<<" could not be opened."<<endl;

Files are read from start to finish in order. The way it works is that the pff_input object is told to look at an event. The event is unpacked and stored in memory. Then all data from that event can be accessed as needed (in any order and multiple times if desired). After that the user tells the object to go to a different event (either the next event or an event with a given ID).

A loop where events are accessed in order might look like this:

pff_intput infile;
if(infile.open_file(path)!=0) return;
while(infile.get_next_event()){
//do some stuff
//
}

On the other hand, a specific event can be accessed as follows:

if(infile.get_event(eventID)!=0)
    cout<<"Event id "<<eventID<<" does not exist."<<endl;

Here eventID is a long long int. Note that the event IDs must be accessed in order as the file will be scanned from start to finish. If the event ID is not found then the file probably scanned to the end and should be closed and reopened.

Data within an event is organized by channels. To loop through all channels and pull data from them, use something like the following:

for(int channel = 0; channel<infile.num_channels(); channel++) {
     char *data;
     unsigned int datasize;
     long long int datatime;
     for(int dataindex = 0; dataindex<infile.num_data(channel); dataindex++) {
          infile.get_data(channel,dataindex,data,datasize,datatime);
          //process data
     }
}

The data can also be accessed by module/channel ID. Additionally, other functions exist for accessing metadata for the event or out of the file header. See the pbf_input.hh file for a full list of functionality.

When you are done reading a file, it's a good idea to close it again. This is done simply by:

infile.close_file();

Please note that the data blocks accessed in this was are identical to what you put in. Any packing done by the file format will be reversed when the data is returned. For example if you provide unzipped data and asked pff_output to compress it, it will be automatically decompressed and returned unzipped.

A full example of reading from files and plotting the resulting data is given in examples/WaveformViewer. This example requires the ROOT software package from CERN (http://root.cern.ch). In the example, the stored data are waveforms where the binary dump is a series of 16-bit (2 byte) samples.

Authors and Contributors

Written by Daniel Coderre (@coderdj), 2014.

Support or Contact

Having trouble? File an issue report at http://github.com/coderdj/libpbf. Want to help? Contact me on github (@coderdj). Want to buy me a beer? Come visit Bern.