-1.6 C
New York
Thursday, February 2, 2023

A data model and file format to represent and store high frequency energy monitoring and disaggregation datasets | Scientific Reports – Nature.com

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
Scientific Reports volume 12, Article number: 10284 (2022)
262 Accesses
Metrics details
There is a generalized consensus in the Non-Intrusive Load Monitoring research community on the importance of public datasets for improving this research field. Still, despite the considerable efforts to release public data, what is currently available suffers from serious issues, among which is the lack of widely accepted data models and common interfaces to access the currently available and future datasets. This paper proposes the Energy Monitoring and Disaggregation Data Format (EMD-DF64). EMD-DF64 is a data model, file format, and application programming interface developed to provide a unique interface to create, manage, and access high-frequency (≥ 1 Hz) electric energy consumption datasets. More precisely, the present paper describes the data model and its respective implementation, which was done by leveraging the well-known Sony WAVE64 format that supports the storage of audio data and metadata annotations.
Non-Intrusive Load Monitoring (NILM), or more generally load/energy disaggregation, is a promising approach for enabling a cost-effective way of providing detailed information about the energy consumption of individual appliances1. The availability of public datasets opens new prospects for this technology, helping researchers create more systematic evaluation processes that can be used across different approaches, similar to what happened in other application domains of machine learning like face and speech recognition2.
Nevertheless, despite the tremendous efforts in releasing public datasets as reviewed in3,4, not many steps have been taken towards homogenizing the way these are made available and accessible to the community. In fact, Several data formats have been employed to store NILM datasets, with Comma-separated values (CSV) emerging has one of the most used formats given the simplicity of the data representation, e.g.,5,6,7. Yet, many other formats were adopted, namely HDF58, relational databases9, as well as audio/video formats such as FLAC10,11, WAVE12,13, and Matroska Media Containers14. A direct consequence of the variety of data formats is that in order to use a dataset, researchers first have to understand the underlying structure of the data and produce code to interface with their algorithms. In fact, in15 the authors have identified the lack of convergence regarding a storage file format as one of the main challenges to NILM dataset interoperability.
Against this background, some research efforts have been devoted to homogenize existing datasets and provide a single interface to run evaluations. Two examples of such efforts are the Metadata proposal for NILM datasets16, and the Non-Intrusive Load Monitoring Toolkit (NILMTK)17. The two projects have been combined to form the NILMTK-DF, the data format supported by NILMTK that relies on HDF5, and the metadata structure proposed in16. Another example is the NilmDB project18, which provides a generalized user interface to access, query, and analyze large time-series datasets, including NILM data. The underlying timestamped data is organized hierarchically and must follow a data stream layout (the number of columns and corresponding data type) determined when a data stream is created. Besides the time series data, NilmDB also supports attaching metadata in the form of key/value pairs.
In19 we proposed the Energy Monitoring and Disaggregation Data Format (EMD-DF), a standard data model and file format developed with the intent to represent and store high-frequency datasets (i.e., sampling rate greater or equal to 1 Hz). In EMD-DF, the datasets are represented using the well-known Waveform Audio File Format (WAVE) (see http://fileformats.archiveteam.org/wiki/WAV). The idea of using an audio-based format to represent electric energy data was inspired by previous works where sound-cards are used to perform the acquisition of the current and voltage signals10,20, and have been shown to perform considerably well given the similarities to electricity waveforms21. Furthermore, being an extension of the Resource Interchange File Format (RIFF) (see http://fileformats.archiveteam.org/wiki/RIFF), the WAVE format has several properties that are desirable in the context of NILM datasets. More particularly:
The consumption data, ground-truth, and metadata are all stored in the same file, thus limiting the number of artifacts to be managed;
The resulting files are optimized to have very little overhead. Furthermore, since the sampling rate is fixed, only the initial timestamp is necessary to obtain the time of the remaining samples;
It is an uncompressed lossless format, i.e., all the original values of the data are kept untouched. Furthermore, it is fully compatible with audio-compression libraries such as WavPack (see https://www.wavpack.com/);
It is possible to extend the format at any time with additional chunks without breaking the file consistency, i.e., a WAVE file with additional chunks will consistently be recognized as a WAVE file. Hence backward compatibility with previously developed applications is guaranteed;
Finally, there is a diversity of mature programming interfaces to work with audio content, thus facilitating dataset manipulation and portability across different programming environments.
However, extending the WAVE file format comes with two major limitations:
To avoid having to store individual timestamps for each sample, the data must not contain missing values. Therefore, missing data is handled separately, either by: (i) resampling whenever possible, i.e., when the number of missed samples is short and sparse, (ii) break the datasets in different files when missing big blocks of data, and (iii) resampling and breaking into multiple files when the missed data is both sparse and with large gaps.
The WAVE specification uses a 32-bit unsigned integer to represent the file size header. As such, dataset files are limited to a maximum of 4 GB. This is equivalent to roughly 248 days of two 16-bit channels sampled at 50 Hz. Still, there are datasets with sample rates in the order of kHz. For example, BLUED22 was sampled at 12 kHz, meaning that each data file can only represent around 16 h of the three 16-bit channels (i.e., two currents phases and one voltage phase).
In this paper, we present a 64-Bit version of EMD-DF (EMD-DF64) by including support to the Sony WAVE64 (see http://fileformats.archiveteam.org/wiki/Sony_Wave64) file format. Ultimately, this leads to a maximum file size of approximately 16 exabytes, which is equivalent to roughly 21 years of three 16-bits channels sampled at 4.3 GHz). Furthermore, EMD-DF64 was developed to support missing data by default, which is achieved by adding additional chunks to represent such gaps.
In the remainder of this paper, we first present a data model that supports EMD-DF64. Then, we present the data structure, which is an extension of the 64 bit Sony Wave64 file container. Then, we thoroughly describe how the data model and data structure are combined to form the EMD-DF64 file format. Finally, we discuss how the proposed data format contributes to the ongoing efforts to homogenized electricity consumption datasets.
The data model that supports EMD-DF and EMD-DF64 is comprised of several data entities that should be present in a dataset to make it suitable for NILM research. Figure 1 shows an illustration of the proposed data model using the Unified Modeling Language (UML) (see https://www.uml.org/) notation. Overall, there are three main data entities: (1) consumption, (2) ground-truth, and (3) annotations. These are described next.
EMD-DF64: Data model overview.
The consumption entity represents all the data elements that refer to energy consumption time-series. Consumption data can be of two different types: (i) raw waveforms, i.e., current and voltage; or (ii) processed waveforms, i.e., different metrics like real and reactive power. Furthermore, consumption data can represent both aggregated and disaggregated consumption. Finally, the latter can represent the consumption of individual appliances (e.g., a kettle) or an individual circuit (e.g., kitchen outlets that may or not contain the kettle). It is important to remark that all the entities that refer to consumption data are optional (cardinality of 0 or more) to cover as many variations of NILM datasets as possible. For example, BLUED contains only aggregated consumption data, while SustDataED13 contains aggregated and individual appliance consumption. Likewise, RAE23 contains data for aggregated, individual appliances and individual circuit consumption. Finally, it is also possible to find datasets that do not contain any form of aggregated consumption, e.g., PLAID24 , Tracebase25 and GREEND26.
On the right-hand side of Fig. 1 is the ground-truth entity. This entity is mandatory on a NILM dataset, and can be of four different types: (i) individual appliance consumption, (ii) individual circuit consumption, (iii) appliance activity, and (iv) user activity. Individual appliance and individual circuit consumption are a special type of consumption data used to train, test, and validate event-less approaches. On the other hand, appliance activities provide information about the power events that exist on a dataset and are required to train, validate, and test event-based approaches.
We have also introduced the concept of user activities, which is straightforward terms refer to actions that people perform involving the use of electric appliances, e.g., doing the laundry (involves clothes washer, clothes dryer, and iron) or preparing a meal (oven, stove, microwave, aid choppers, and blenders). Such user activities are important to evaluate different NILM application domains, e.g., Non-Intrusive Activity Detection (NIAD)27, and the detection of abnormal consumption behaviors28. It is important to note that one individual appliance activity can only be associated with one user activity (cardinality of 0 or 1). Otherwise, the total consumption of the user activities will be larger than the total consumption of the individual appliances, thus introducing inconsistency to the data model.
Lastly, we have the annotations entity. There are three types of optional annotations: (1) RIFF Meta, which are the default metadata chunks defined by RIFF, (2) comments, consisting of free text, and 3) user-defined metadata. The latter can be either which can be either local metadata, which refers to specific samples in the consumption data, or custom metadata annotations defined by the dataset creator and can serve multiple purposes. For example, in the current implementation, it is possible to provide a list with details for individual appliances and user activities and embed the annotations from the NILM metadata project.
In EMD-DF64, the different data entities are represented by extending the 64-bit Sony WAVE64 (SW64) file format. SW64 is an application of the RIFF in which the file contents are grouped and stored in separate chunks. Each chunk consists of three components: (1) chunk identifier—128-bit globally unique identifier (GUID), e.g., “fmt” and “data”; (2) an unsigned, little-endian 64-bit integer representing the length of the chunk; (3) the chunk data. Finally, like all other RIFF-based formats, if a chunk’s data size is not even, it is padded by 1 byte to make it so.
A WAVE64 file is composed of several chunks, four of which are mandatory. Furthermore, in a correctly formatted WAVE64 file, the first four bytes (GUID) must always spell out “riff” (lower-case). Some of the W64 chunks are briefly described in Table 1, with a particular focus on those reused in the EMD-DF data model. For more details about the RIFF standard, please refer to the original project documentation in29.
The EMD-DF64 file format is based on 20 chunks. One directly inherited from the RIFF standard (Info), nine from the WAVE64 format (Format, Data, Cue, List, Associated Data List, Label, Note and Labeled Text Chunk), and the remaining 10 are custom chunks. The chunk structure of EMD-DF64 is illustrated in Fig. 2. Table 2 provides a description of the 10 custom chunks.
Chunk structure of the EMD-DF 64 file format.
Next, we present how the different chunks that compose the EMD-DF64 data structure are combined to create a dataset. We first describe how the data format is defined in the Format and Config chunks. Then we show how the power measurements are stored and supplemented with the different embedded annotations.
The waveform data (i.e., consumption data) must be defined in the Format chunk. This is inherited from the W64 format and consists of the following fields: (i) sample size in bits (8, 16, 24, 32 or 64 bits); and (ii) number of individual channels (greater or equal to 1).
Additionally, all the sub-chunks defined in the Config list chunk are mandatory. More precisely: (i) timezone (the time zone of the location where the data was collected), (ii) timestamp (the Unix timestamp of the first sample in the waveform data), (iii) sampling rate (the number of samples per second in the waveform data), and (iv) calibration constants (zero or one for each waveform channel). The calibration constant chunks are associated to each channel in ascending order. For the model to be valid, the number of calibration chunks must be zero (i.e., no calibration is needed) or equal to the number of individual channels.
The consumption data are represented using the Data, Missing Data List, and Missing Data chunks. The Missing Data List wraps sequences of Missing Data chunks to support datasets with missing data.
Waveform data are stored uncompressed in Data chunks. If only one metric needs to be represented (this is the case in most individual appliance and circuit ground-truth data), the samples are stored consecutively; otherwise the samples are stored interleaved. Each sample S is represented by a value between − 1 and 1. Samples are stored in little-endian format (i.e., the least significant byte is stored first). The bits that represent the sample amplitude are stored in the most significant bits of S, and the remaining bits are set to zero.
Intervals with missing data are represented using the Missing Data chunk. Each of these chunks contains a JSON string with information about the timestamp when data is again available, and the number of the sample where this happens.
Since we do not store the timestamp of each individual sample, this has to be calculated in run-time. This is done using Eq. (1), which returns a Unix timestamp in milliseconds:
Where current_sample is the position of the sample of interest, initial_sample is the position of the first sample (if there are no missing data chunks the first sample is 1, otherwise it is the initial sample of the corresponding missing data chunk). initial_unix_timestamp is the unix timestamp of the first sample (if there are no missing data it is given by the Timezone chunk, otherwise it is calculated from the corresponding missing data chunk). Finally, f is the sampling rate of the waveform data.
Conversely, it is possible to convert a unix timestamp to a sample position. This is done using Eq. (2):
Where actual_timestamp is the timestamp in milliseconds to be mapped to an audio position, initial_timestamp is the timestamp in milliseconds of the first sample in the dataset, and f is the sampling rate of the waveform data.
Individual appliance activities correspond to the changes in the power consumption that are triggered by different appliance turning ON, OFF, or changing their working mode (e.g., low to high).
Individual appliance activity (i.e., power events), are embedded in the file using the Cue, Associated Data List and Label chunks. This is done as follows: (1) For each power event, an entry is added in the Cue chunk, (2) for each Cue chunk entry, a Label chunk is created and added to the Associated Data List chunk.
Each label chunk consists of a sample position in the waveform data and a JSON formatted string with the details of the respective activity. The sample position is calculated from the power event timestamp using Eq. (2): For example, the JSON in Listing 2 corresponds to a refrigerator activity that was mapped to position 19394633:
User activities refer to actions that are performed involving the use of electric appliances, e.g., doing the laundry (involves clothes washer, clothes dryer and iron) or preparing a meal (oven, stove, microwave, aid choppers, blenders, etc.).
User activities are supplemented in the consumption data using the Cue, Associated Data List and Labeled Text chunks. This is done as follows: (1) For each user activity, an entry is added in the Cue chunk, (2) for each Cue chunk entry, a Labeled text chunk is created and added to the Associated Data List chunk.
Each labeled text chunk consists of a sample position in the waveform data, a duration in samples, and a JSON formatted string with the details of the respective activity. Listing 3 shows a JSON representing the “working on the computer” activity that involves using the desktop computer (App_ID: 1101), one monitor (App_ID: 1109) and a printer (App_ID: 1203). The duration in samples is obtained by subtracting the start from the end position of the activity. The sample positions are calculated from the timestamps using Eq. (2).
Since EMD-DF64 is a direct application of the RIFF standard, it fully supports all the default RIFF metadata sub-chunks that are defined in the Info chunk. These are used to supplement general annotations, and include among others, Creator, Commissioner, Copyright, and Keywords. The RIFF metadata chunks supported in EMD-DF64 are listed in Table 3.
Local metadata annotations can be used for instance to supplement datasets with details like the instant when a new appliance is added or removed from the electric circuit. Local metadata are used to supplement specific consumption samples with custom annotations. For instance, when a new appliance is added or removed from the electric circuit as shown in JSON Listing 4.
A note chunk consists of a sample position and a JSON formatted string. Notes are created using the Cue, Additional Data List and Note chunks as follows: (1) For each local annotation, an entry is added to the Cue chunk, (2) for each entry in the Cue chunk, a Note chunk is added to the Associated Data List chunk.
These custom chunks can be used to enrich datasets with custom metadata. They are added using the Annotation List and Metadata chunks.
The content of such chunks do not follow any specific rule, yet it must be encoded in JSON and always include the ID and Label fields. EMD-DF64 fully supports three different custom metadata types: (i) appliances; (ii) user activities; and (iii) NILM metadata project annotations.
Keeps a list of the appliances that co-exist in the dataset, including the appliance characteristics like brand, model, energy consumption and energy efficiency rating. Listing 5 shows a possible JSON representation of the appliances metadata chunk.
Keeps a list of the user activities that are present in the dataset, including a list of the appliances that can be associated with each activity. An example is provided in Listing 6.
It is also possible to supplement datasets with annotations from the NILM metadata project. To this end we have defined the NILM Metadata Project annotation that can be used to embed the content of the different YAML files that compose the NILM Metadata project, in a metadata chunk. An example annotation is provided in Listing 7.
Custom comment chunks consist of free form text and are created using the Annotation List and Comment chunks. These can be used to add any kind of comments, for example, add a comment containing the historic of previous performance evaluations results on that particular file or dataset. Another example would be, adding a comment regarding some external event that could have affected the data.
Since most of the annotation data will be done using the JSON format we have decided to use JSON schemas (see http://json-schema.org/) to describe the JSON data elements presented in the previous sub-section. Listing 8 shows a snippet of the JSON-Schema for the appliance activity labels.
The present version of EMD-DF64 was implemented using the Java programming language. This is an open-source project that can be found in an online repository (see https://gitlab.com/alspereira/EMD-DF). Online documentation is also available (see https://manelpereira.gitlab.io/emd-df-documentation/).
The UML class diagram of EMD-DF64 is provided in Fig. 3. The library is composed of three main packages: de.sciss.io, emddf.file, and emddf.api.
EMD-DF64 class diagram.
The de.sciss.io package is responsible for the audio file I/O and provides also the core classes to add ground-truth annotations (Marker and Region). This package was reused from the ScissLib project (see https://github.com/Sciss/ScissLib).
The emddf.file package provides all the basic structures to represent EMD-DF and EMD-DF64 files. The EMDDFAudioFiles allows reading and writing the audio files. The supported formats are Waveform Audio File Format (WAVE) and Sony Wave64 (W64), which support the EMD-DF, and EMD-DF64 formats, respectively. The EMDDFFileDescr class provides the data structure that describes the format of an EMDDFAudioFile. These two classes were adapted from the ScissLib project classes AudioFile and AudioFileDescr. The remaining classes provide the basic structure to define the different EMD-DF and EMD-DF64 chunks.
The emddf.api package provides an API to manage EMD-DF and EMD-DF64 files. The API was developed around the four basic operations of persistent storage: Create; Read; Update; Delete (CRUD). The API also offers a brige to map the EMDDF-64 behavior in different programming languages.
Finally, the emddf.demo package provides concrete application examples of this EMD-DF64 implementation. Presently there are three demos: i) Explorer that provides a command-line interface to explore different EMD-DF64 datasets, ii) BLUED that shows a concrete example of adding ground-truth labels to an EMD-DF6 dataset, and iii) SustDataED that provides a concrete example of handing missing data in EMD-DF64 datasets. These three demos are brifly described in Table 4.
All the modern programming languages (e.g., Python, C++, and MATLAB) have built-in mechanisms to handle audio files. To make the integration with other programming languages easier, a Java Bridge was developed, to expose the functionalities of EMDDF-64. To exemplify its usage, two demos were were developed and released together with EMDDF-64. The first demo uses Python, leveraging the pyemddf (see https://pypi.org/project/pyemddf/) package. The second demo, was developed for MATLAB and leverages the External Language Interfaces (see https://www.mathworks.com/help/matlab/using-java-libraries-in-matlab.html). The two demos are briefly described in Table 4.
As mentioned in the introduction, one of the key advantages of using an audio format to store high-frequency datasets is the overall file size reduction. Therefore, this section presents a comparative analysis of three audio formats, namely, Free Lossless Audio Codec (FLAC), WAVE, and W64. FLAC was introduced in10 to store the UK-DALE, whereas WAVE and W64 are the two formats that support EMD-DF and EMD-DF64, respectively.
The comparisons are performed using the Fully Labeled Public Dataset for Event-Based Non-Intrusive Load Monitoring Research (BLUED), considering the raw current and voltage waveforms. The original BLUED distribution contains one week of energy consumption of a two-phase electric system from a household in the United States. The current and voltage signals were sampled at 12 kHz, and are available in text files (.txt) containing a timestamp and the values for current (phase A and B), and voltage (phase A). Before conversion, the raw current and voltage waveforms were scaled to − 1 and 1 by dividing each sample by the maximum value of each signal (75 and 180, respectively). Then, the scaled waveforms were converted to FLAC, WAVE, and W64 using the dsCleaner library30. The following format parameters were used: sampling rate: 12 kHz, sample size: 16 bits and number of channels: 2 (current and voltage). Individual files were created for each phase, by assuming that the voltage of phase B is the voltage of phase A shifted by (90^{circ }).
For comparison, the following quantities were considered: (1) the number of files based on the maximum file size; (2) the total size (in GB), (3) compressed size using WavPack compression. The results are summarized in Table 5.
As expected, the three audio formats enable a very significant decrease in the dataset size without losing any information. In the case of FLAC the reduction is around 92%, whereas for WAVE and W64 the reduction is 82% and 85%, respectively. However, it should be stressed that FLAC is at its core an audio compression tool, hence the considerably lower size of the resulting files. Therefore, to achieve a fair comparison, the WavPack audio compressor tool was used to compress the WAVE and W64 files. Ultimately, this resulted in a reduction in the file size of 90%, and 91% for WAVE, and W64, respectively.
Finally, regarding the number of files, it was possible to represent the entire dataset in two files (one per phase) when using the W64 format. As for the WAVE format, due to the limitations of the 32-bits header, it was necessary to break the dataset into 7 files for each phase. Finally, in the case ofFLAC, it was not possible to merge the files using the dsClearner API since read/write operations are not allowed in the FLAC format. Instead, external software, e.g., Audacity (see https://www.audacityteam.org/), would be required to perform the merge operation manually.
EMD-DF64 was developed to help mitigate the lack of homogeneity across existing electricity consumption datasets, which poses significant challenges for researchers intending to use datasets comparatively.
In15, the authors offered 17 suggestions for improving the collection, storage, and provision of electricity datasets. Next, we briefly discuss how EMD-DF64 promotes adopting some of those suggestions from a data storage and representation point of view.
Suggestion 5: gaps and irregularities EMD-DF64 supports missing data by default. Furthermore, by indicating the timestamps with missing data, it is straightforward to implement data cleaning strategies such as data interpolation. Local metadata can also be used to annotate regions where data is out of distribution (e.g., considerable changes to the voltage signal).
Suggestion 10: annotation of traces of event/activity information EMD-DF64 supports embedded annotations for both appliances (labels) and user activities (labeled text) by default. Furthermore, local metadata (notes) can be used to embed information on relevant changes, e.g., addition/removal/replacement of an appliance or changes to the number of household members.
Suggestion 12: metadata formatted in reusable machine way/easy processing In EMD-DF64, metadata is represented using JSON, a widely used data-interchange format. Besides that, it is lightweight (e.g., compared to XML) and easy for humans to read and write. Furthermore, to remove ambiguity and errors when labeling the dataset, the annotations are validated using JSON schemas before being embedded in the data files.
Suggestion 14: compatibility of formats with software and compression By extending an audio format, it is possible to take advantage of the many audio libraries available in the different programming languages. For instance, in our particular case, we provide a Java implementation of EMD-DF64, which can be easily integrated with Python using the pyemddf package.
As for compression, EMD-DF64 is fully compatible with the WavPack audio compression format, which provides both lossless and hybrid (i.e., lossy with the possibility of complete restoration) compression modes. The compression ratio depends on the source material but generally is between 30% and 70%. For example, using SustDataED, we have observed compression rates of 52% for raw current and voltage waveforms sampled at 12.8 kHz. On the other hand, a rate of 70% was achieved for active power, reactive power, current, and voltage measurements sampled at 50 Hz.
This paper presented a 64-bits file format for representing and storing high-frequency energy monitoring and disaggregation datasets. While we are fully aware that another format alone is not enough to solve the problem of lack of convergence between file formats in NILM research, we believe that in the era of big data, datasets will not only grow in volume but also in variety. Thus, it is necessary to provide mechanisms that are already optimized to handle such high volumes of data while also supporting a wide variety of mechanisms for metadata annotations, which is the case of audio file formats.
While different file formats have been proposed in the past, as summarized15, one of the main challenges is to increase the adoption by the research community. To this end, in immediate future work, we will integrate EMD-DF64 datasets in the Non-Intrusive Load Monitoring Toolkit (NILMTK) pipeline. This will not only increase the visibility of EMD-DF64 but also mitigate one of the current limitations of NILMTK which is the fact that the underlying NILMTK Data Format (NILMTK-DF) is not adequate to handle high-frequency datasets.
One of the limitations of EMD-DF64 is the fact that it does not support dataset querying by default, which is supported in alternative formats like relational databases and Hierarchical Data Format (HDF5). Therefore, another important future research direction would be to extend the API module with query operations. For example, loading the consumption waveforms that are associated with a given user activity. Furthermore, this query language should also allow the integration between aggregated and individual appliance consumption, which lies in different EMD-DF64 files.
Finally, it should be noted that although the current work deals with electricity consumption datasets, all the underlying concepts can be extended to support other types of time-series data. Moreover, since most of the annotations are added using JSON strings, any adaptation to other types of datasets should be very straightforward.
All the data necessary to run the demo applications (BLUED and SustDataED) are available in the projects’ OSF repository31. All the datasets used in the demo applications are available from the corresponding author on reasonable request. Please refer to the original publications for additional details.
The present version of EMD-DF64 was implemented using the Java programming language. The EMD-DF64 implementation is released under the GNU General Public License (GPL) 3.0. Source-code and documentation are available in the project page at the Open Science Framework (OSF)31.
Hart, G. Prototype nonintrusive appliance load monitor. Tech. Rep., MIT Energy Laboratory Technical Report, and Electric Power Research Institute Technical Report (1985).
Zeifman, M. & Roth, K. Nonintrusive appliance load monitoring: Review and outlook. IEEE Trans. Consumer Electron. 57, 76–84. https://doi.org/10.1109/TCE.2011.5735484 (2011).
Article  Google Scholar 
Pereira, L. & Nunes, N. Performance evaluation in non-intrusive load monitoring: Datasets, metrics, and tools-a review. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 8, e1265 (2018).
Article  Google Scholar 
Himeur, Y., Alsalemi, A., Bensaali, F. & Amira, A. Building power consumption datasets: Survey, taxonomy and future directions. Energy Build. 227, 110404. https://doi.org/10.1016/j.enbuild.2020.110404 (2020).
Article  Google Scholar 
Gao, J., Giri, S., Kara, E. C. & Bergés, M. PLAID: a public dataset of high-resolution electrical appliance measurements for load identification research: demo abstract. In Proceedings of the 1st ACM Conference on Embedded Systems for Energy-Efficient Buildings, BuildSys ’14, (ACM, New York, NY, USA, 2014). https://doi.org/10.1145/2674061.2675032
Makonin, S., Ellert, B., Bajić, I. V. & Popowich, F. Electricity, water, and natural gas consumption of a residential house in Canada from 2012 to 2014. Sci. Data 3, 160037 https://doi.org/10.1038/sdata.2016.37 (2016).
Article  PubMed  PubMed Central  Google Scholar 
Murray, D., Stankovic, L. & Stankovic, V. An electrical load measurements dataset of United Kingdom households from a two-year longitudinal study. Sci. Data 4, 160122 (2017).
Article  Google Scholar 
Kriechbaumer, T. & Jacobsen, H.-A. BLOND, a building-level office environment dataset of typical electrical appliances. Sci. Data 5, 180048. https://doi.org/10.1038/sdata.2018.48 (2018).
Article  PubMed  PubMed Central  Google Scholar 
Batra, N., Gulati, M., Singh, A. & Srivastava, M. B. It’s different: insights into home energy consumption in India. In Proceedings of the 5th ACM Workshop on Embedded Systems For Energy-Efficient Buildings, BuildSys’13, 3:1–3:8, (ACM, 2013). https://doi.org/10.1145/2528282.2528293
Kelly, J. & Knottenbelt, W. The UK-DALE dataset, domestic appliance-level electricity demand and whole-house demand from five UK homes. Sci. Data 2, 150007 https://doi.org/10.1038/sdata.2015.7 (2015).
Article  PubMed  PubMed Central  Google Scholar 
Kahl, M., UI Haq, A., Kriechbaumer, T. & Hans-Arno, J. WHITED—A worldwide household and industry transient energy data set. In 3rd International NILM Workshop (2016).
Pereira, L., Nunes, N. & Bergés, M. SURF and SURF-PI: a file format and API for non-intrusive load monitoring public datasets. In Proceedings of the 5th International Conference on Future Energy Systems, e-Energy ’14, 225–226, (ACM, 2014). https://doi.org/10.1145/2602044.2602078
Ribeiro, M., Pereira, L., Quintal, F. & Nunes, N. SustDataED: a public dataset for electric energy disaggregation research. In Proceedings of ICT for Sustainability 2016, Advances in Computer Science Research, 244–245, (Atlantis Press, 2016). https://doi.org/10.2991/ict4s-16.2016.36
Völker, B., Scholl, P. M. & Becker, B. Semi-automatic generation and labeling of training data for non-intrusive load monitoring. In Proceedings of the Tenth ACM International Conference on Future Energy Systems, e-Energy ’19, 17–23, (ACM, 2019). 10/gf32w9
Klemenjak, C. et al. Electricity consumption data sets: pitfalls and opportunities. In Proceedings of the 6th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation, BuildSys ’19, 159–162, (ACM, 2019).
Kelly, J. & Knottenbelt, W. Metadata for energy disaggregation. In 2014 IEEE 38th International Computer Software and Applications Conference Workshops, 578–583, (2014) https://doi.org/10.1109/COMPSACW.2014.97
Batra, N. et al. NILMTK: an open source toolkit for non-intrusive load monitoring. In Proceedings of the 5th International Conference on Future Energy Systems, e-Energy ’14, 265–276, (ACM, 2014). https://doi.org/10.1145/2602044.2602051
Paris, J., Donnal, J. S. & Leeb, S. B. NilmDB: The non-intrusive load monitor database. IEEE Trans. Smart Grid 5, 2459–2467. https://doi.org/10.1109/TSG.2014.2321582 (2014).
Article  Google Scholar 
Pereira, L. EMD-DF: A data model and file format for energy disaggregation datasets. In Proceedings of the 4th ACM International Conference on Systems for Energy-Efficient Built Environments (ACM, 2017).
Pereira, L., Quintal, F., Nunes, N. & Bergés, M. The design of a hardware-software platform for long-term energy eco-feedback research. In ACM SIGCHI Symposium on Engineering Interactive Computing Systems, EICS ’12, 221–230, (ACM, 2012). https://doi.org/10.1145/2305484.2305521
Kriechbaumer, T., Jorde, D. & Jacobsen, H.-A. Waveform signal entropy and compression study of whole-building energy datasets. In Proceedings of the Tenth ACM International Conference on Future Energy Systems, 58–67, (ACM, 2019).
Anderson, K. et al. BLUED: a fully labeled public dataset for event-based non-intrusive load monitoring research. In Proceedings of the 2nd KDD Workshop on Data Mining Applications in Sustainability (SustKDD), 1–5 (2012).
Makonin, S., Wang, Z. J. & Tumpach, C. RAE: The rainforest automation energy dataset for smart grid meter data analysis. Data 3, 8. https://doi.org/10.3390/data3010008 (2018).
Article  Google Scholar 
Medico, R. et al. A voltage and current measurement dataset for plug load appliance identification in households. Sci. Data 7, 1–10. https://doi.org/10.1038/s41597-020-0389-7 (2020).
Article  Google Scholar 
Reinhardt, A. et al. On the accuracy of appliance identification based on distributed load metering data. In 2012 Sustainable Internet and ICT for Sustainability (SustainIT), 1–9 (2012).
Monacchi, A., Egarter, D., Elmenreich, W., D’Alessandro, S. & Tonello, A. M. GREEND: an energy consumption dataset of households in Italy and Austria. In 2014 IEEE International Conference on Smart Grid Communications (SmartGridComm), 511–516, (2014). https://doi.org/10.1109/SmartGridComm.2014.7007698
Reinhardt, A. & Klemenjak, C. Device-free user activity detection using non-intrusive load monitoring: a case study. In Proceedings of the 2nd ACM Workshop on Device-Free Human Sensing, DFHS’20, 1–5, (Association for Computing Machinery). https://doi.org/10.1145/3427772.3429391
Bousbiat, H., Klemenjak, C., Leitner, G. & Elmenreich, W. Augmenting an assisted living lab with non-intrusive load monitoring. In 2020 IEEE International Instrumentation and Measurement Technology Conference (I2MTC), 1–5, https://doi.org/10.1109/I2MTC43012.2020.9128406.
Multimedia Programming Interface and Data Specifications 1.0. Tech. Rep.(IBM Corporation and Microsoft Corporation, 1991).
Pereira, M., Velosa, N. & Pereira, L. dsCleaner: A python library to clean, preprocess and convert non-intrusive load monitoring datasets. Data 4, 123. https://doi.org/10.3390/data4030123 (2019).
Article  Google Scholar 
Pereira, L., Pereira, M. & Velosa, N. EMD-DF64: A 64-Bit File Format for Energy Monitoring and Disaggregation Datasets (2021). https://doi.org/10.17605/OSF.IO/D7EBX
Download references
ITI, LARSyS, Técnico Lisboa, 1049-001, Lisbon, Portugal
Lucas Pereira
Universidade da Madeira, 9020-105, Funchal, Portugal
Nuno Velosa & Manuel Pereira
You can also search for this author in PubMed Google Scholar
You can also search for this author in PubMed Google Scholar
You can also search for this author in PubMed Google Scholar
Conceptualization, L.P.; methodology, L.P.; research, L.P., N.V. and M.P.; development, L.P., N.V. and M.P.; original draft preparation, L.P., N.V., and M.P.; review and editing, L.P.; visualization, L.P.
Correspondence to Lucas Pereira.
The authors declare no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Reprints and Permissions
Pereira, L., Velosa, N. & Pereira, M. A data model and file format to represent and store high frequency energy monitoring and disaggregation datasets. Sci Rep 12, 10284 (2022). https://doi.org/10.1038/s41598-022-14517-y
Download citation
DOI: https://doi.org/10.1038/s41598-022-14517-y
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.
Advanced search
Scientific Reports (Sci Rep) ISSN 2045-2322 (online)
© 2022 Springer Nature Limited
Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.


Related Articles


Please enter your comment!
Please enter your name here

Latest Articles