Input and output (I/O) operations refer to the transfer of data between a computer’s main memory and various peripherals. Storage peripherals such as HDDs and SSDs have particular performance characteristics in terms of latency, throughput, and rate which can influence the performance of the computer system they power. Extrapolating, the performance and design of distributed and cloud based Data Storage depends on that of the medium. This article is intended to be a bridge between Data Science and Storage Systems: 1/ I am sharing a few datasets of various sources and sizes which I hope will be novel for Data Scientists and 2/ I am bringing up the potential for advanced analytics in Distributed Systems.
Storage access traces are “a treasure trove of information for optimizing cloud workloads.” They’re crucial for capacity planning, data placement, or system design and evaluation, suited for modern applications. Diverse and up-to-date datasets are particularly needed in academic research to study novel and unintuitive access patterns, help the design of new hardware architectures, new caching algorithms, or hardware simulations.
Storage traces are notoriously difficult to find. The SNIA website is the best known “repository for storage-related I/O trace files, associated tools, and other related information” but many traces don’t comply with their licensing or upload format. Finding traces becomes a tedious process of scanning the academic literature or attempting to generate one’s own.
Popular traces which are easier to find tend to be outdated and overused. Traces older than 10 years should not be used in modern research and development due to changes in application workloads and hardware capabilities. Also, an over-use of specific traces can bias the understanding of real workloads so it’s recommended to use traces from multiple independent sources when possible.
This post is an organized collection of recent public traces I found and used. In the first part I categorize them by the level of abstraction they represent in the IO stack. In the second part I list and discuss some relevant datasets. The last part is a summary of all with a personal view on the gaps in storage tracing datasets.
I distinguish between three types of traces based on data representation and access model. Let me explain. A user, at the application layer, sees data stored in files or objects which are accessed by a large range of abstract operations such as open or append. Closer to the media, the data is stored in a continuous memory address space and accessed as blocks of fixed size which may only be read or written. At a higher abstraction level, within the application layer, we may also have a data presentation layer which may log access to data presentation units, which may be, for example, rows composing tables and databases, or articles and paragraphs composing news feeds. The access may be create table, or post article.
While traces can be taken anywhere in the IO stack and contain information from multiple layers, I am choosing to structure the following classification based on the Linux IO stack depicted below.
Block storage traces
The data in these traces is representative of the operations at the block layer. In Linux, this data is typically collected with blktrace (and rendered readable with blkparse), iostat, or dtrace. The traces contain information about the operation, the device, CPU, process, and storage location accessed. The first trace listed is an example of blktrace output.
The typical information generated by tracing programs may be too detailed for analysis and publication purposes and it is often simplified. Typical public traces contain operation, offset, size, and sometimes timing. At this layer the operations are only read and write. Each operation accesses the address starting at offset and is applied to a continuous size of memory specified in number of blocks (4KiB NTFS). For example, a trace entry for a read operation contains the address where the read starts (offset), and the number of blocks read (size). The timing information may contain the time the request was issued (start time), the time it was completed (end time), the processing in between (latency), and the time the request waited (queuing time).
Available traces sport different features, have wildly different sizes, and are the output of a variety of workloads. Selecting the right one will depend on what one’s looking for. For example, trace replay only needs the order of operations and their size; For performance analysis timing information is needed.
Object storage traces
At the application layer, data is located in files and objects which may be created, opened, appended, or closed, and then discovered via a tree structure. From an user’s point of view, the storage media is decoupled, hiding fragmentation, and allowing random byte access.
I’ll group together file and object traces despite a subtle difference between the two. Files follow the file system’s naming convention which is structured (typically hierarchical). Often the extension suggests the content type and usage of the file. On the other hand, objects are used in large scale storage systems dealing with vast amounts of diverse data. In object storage systems the structure is not intrinsic, instead it is defined externally, by the user, with specific metadata files managed by their workload.
Being generated within the application space, typically the result of an application logging mechanism, object traces are more diverse in terms of format and content. The information recorded may be more specific, for example, operations can also be delete, copy, or append. Objects typically have variable size and even the same object’s size may vary in time after appends and overwrites. The object identifier can be a string of variable size. It may encode extra information, for example, an extension that tells the content type. Other meta-information may come from the range accessed, which may tell us, for example, whether the header, the footer or the body of an image, parquet, or CSV file was accessed.
Object storage traces are better suited for understanding user access patterns. In terms of block access, a video stream and a sequential read of an entire file generate the same pattern: multiple sequential IOs at regular time intervals. But these trace entries should be treated differently if we are to replay them. Accessing video streaming blocks needs to be done with the same time delta between them, regardless of the latency of each individual block, while reading the entire file should be asap.
Access traces
Specific to each application, data may be abstracted further. Data units may be instances of a class, records in a database, or ranges in a file. A single data access may not even generate a file open or a disk IO if caching is involved. I choose to include such traces because they may be used to understand and optimize storage access, and in particular cloud storage. For example, the access traces from Twitter’s Memcache are useful in understanding popularity distributions and therefore may be useful for data formatting and placement decisions. Often they’re not storage traces per se, but they can be useful in the context of cache simulation, IO reduction, or data layout (indexing).
Data format in these traces can be even more diverse due to a new layer of abstraction, for example, by tweet identifiers in Memcached.
Let’s look at a few traces in each of the categories above. The list details some of the newer traces — no older than 10 years — and it is by no means exhaustive.
Block traces
YCSB RocksDB SSD 2020
These are SSD traces collected on a 28-core, 128 GB host with two 512 GB NVMe SSD Drives, running Ubuntu. The dataset is a result of running the YCSB-0.15.0 benchmark with RocksDB.
The first SSD stores all blktrace output, while the second hosts YCSB and RocksDB. YCSB Workload A consists of 50% reads and 50% updates of 1B operations on 250M records. Runtime is 9.7 hours, which generates over 352M block I/O requests at the file system level writing a total of 6.8 TB to the disk, with a read throughput of 90 MBps and a write throughput of 196 MBps.
The dataset is small compared to all others in the list, and limited in terms of workload, but a great place to start due to its manageable size. Another benefit is reproducibility: it uses open source tracing tools and benchmarking beds atop a relatively inexpensive hardware setup.
Format: These are SSD traces taken with blktrace
and have the typical format after parsing with blkparse
: [Device Major Number,Device Minor Number] [CPU Core ID] [Record ID] [Timestamp (in nanoseconds)] [ProcessID] [Trace Action] [OperationType] [SectorNumber + I/O Size] [ProcessName]
259,2 0 1 0.000000000 4020 Q R 282624 + 8 [java]
259,2 0 2 0.000001581 4020 G R 282624 + 8 [java]
259,2 0 3 0.000003650 4020 U N [java] 1
259,2 0 4 0.000003858 4020 I RS 282624 + 8 [java]
259,2 0 5 0.000005462 4020 D RS 282624 + 8 [java]
259,2 0 6 0.013163464 0 C RS 282624 + 8 [0]
259,2 0 7 0.013359202 4020 Q R 286720 + 128 [java]
Where to find it: http://iotta.snia.org/traces/block-io/28568
License: SNIA Trace Data Files Download License
Alibaba Block Traces 2020
The dataset consists of “block-level I/O requests collected from 1,000 volumes, where each has a raw capacity from 40 GiB to 5 TiB. The workloads span diverse types of cloud applications. Each collected I/O request specifies the volume number, request type, request offset, request size, and timestamp.”
Limitations (from the academic paper)
- the traces do not record the response times of the I/O requests, making them unsuitable for latency analysis of I/O requests.
- the specific applications running atop are not mentioned, so they cannot be used to extract application workloads and their I/O patterns.
- the traces capture the access to virtual devices, so they are not representative of performance and reliability (e.g., data placement and failure statistics) for physical block storage devices.
A drawback of this dataset is its size. When uncompressed it results in a 751GB file which is difficult to store and manage.
Format: device_id,opcode,offset,length,timestamp
device_id
ID of the virtual disk,uint32
opcode
Either of ‘R’ or ‘W’, indicating this operation is read or writeoffset
Offset of this operation, in bytes,uint64
length
Length of this operation, in bytes,uint32
timestamp
Timestamp of this operation received by server, in microseconds,uint64
419,W,8792731648,16384,1577808144360767
725,R,59110326272,360448,1577808144360813
12,R,350868463616,8192,1577808144360852
725,R,59110686720,466944,1577808144360891
736,R,72323657728,516096,1577808144360996
12,R,348404277248,8192,1577808144361031
Additionally, there is an extra file containing each virtual device’s id device_id
with its total capacity.
Where to find it: https://github.com/alibaba/block-traces
License: CC-4.0.
Tencent Block Storage 2018
This dataset consists of “216 I/O traces from a warehouse (also called a failure domain) of a production cloud block storage system (CBS). The traces are I/O requests from 5584 cloud virtual volumes (CVVs) for ten days (from Oct. 1st to Oct. 10th, 2018). The I/O requests from the CVVs are mapped and redirected to a storage cluster consisting of 40 storage nodes (i.e., disks).”
Limitations:
- Timestamps are in seconds, a granularity too little for determining the order of operations. As a consequence many requests appear as if issued at the same time. This trace is therefore unsuitable for queuing analysis.
- There is no latency information about the duration of each operation, making the trace unsuitable for latency performance, queuing analytics.
- No extra information about each volume such as total size.
Format: Timestamp,Offset,Size,IOType,VolumeID
Timestamp
is the Unix time the I/O was issued in seconds.Offset
is the starting offset of the I/O in sectors from the start of the logical virtual volume. 1 sector = 512 bytesSize
is the transfer size of the I/O request in sectors.IOType
is “Read(0)”, “Write(1)”.VolumeID
is the ID number of a CVV.
1538323200,12910952,128,0,1063
1538323200,6338688,8,1,1627
1538323200,1904106400,384,0,1360
1538323200,342884064,256,0,1360
1538323200,15114104,8,0,3607
1538323200,140441472,32,0,1360
1538323200,15361816,520,1,1371
1538323200,23803384,8,0,2363
1538323200,5331600,4,1,3171
Where to find it: http://iotta.snia.org/traces/parallel/27917
License: NIA Trace Data Files Download License
K5cloud Traces 2018
This dataset contains traces from virtual cloud storage from the FUJITSU K5 cloud service. The data is gathered during a week, but not continuously because “ one day’s IO access logs often consumed the storage capacity of the capture system.” There are 24 billion records from 3088 virtual storage nodes.
The data is captured in the TCP/IP network between servers running on hypervisor and storage systems in a K5 data center in Japan. The data is split between three datasets by each virtual storage volume id. Each virtual storage volume id is unique in the same dataset, while each virtual storage volume id is not unique between the different datasets.
Limitations:
- There is no latency information, so the traces cannot be used for performance analysis.
- The total node size is missing, but it can be approximated from the maximum offset accessed in the traces.
- Some applications may require a complete dataset, which makes this one unsuitable due to missing data.
The fields in the IO access log are: ID,Timestamp,Type,Offset,Length
ID
is the virtual storage volume id.Timestamp
is the time elapsed from the first IO request of all IO access logs in seconds, but with a microsecond granularity.Type
is R(Read) or (W)Write.Offset
is the starting offset of the IO access in bytes from the start of the virtual storage.Length
is the transfer size of the IO request in bytes.
1157,3.828359000,W,7155568640,4096
1157,3.833921000,W,7132311552,8192
1157,3.841602000,W,15264690176,28672
1157,3.842341000,W,28121042944,4096
1157,3.857702000,W,15264718848,4096
1157,9.752752000,W,7155568640,4096
Where to find it: http://iotta.snia.org/traces/parallel/27917
License: CC-4.0.
Object traces
Server-side I/O request arrival traces 2019
This repository contains two datasets for IO block traces with additional file identifiers: 1/ parallel file systems (PFS) and 2/ I/O nodes.
Notes:
- The access patterns are resulting from MPI-IO test benchmark ran atop of Grid5000, a large scale test bed for parallel and High Performance Computing (HPC). These traces are not representative of general user or cloud workloads but instead specific to HPC and parallel computing.
- The setup for the PFS scenario uses Orange FS as file system and for the IO nodes I/O Forwarding Scalability Layer(IOFSL). In both cases the scheduler was set to AGIOS I/O scheduling library. This setup is perhaps too specific for most use cases targeted by this article and has been designed to reflect some proposed solutions.
- The hardware setup for PFS consists of our server nodes with 600 GB HDDs each and 64 client nodes. For IO nodes, it has four server nodes with similar disk configuration in a cluster, and 32 clients in a different cluster.
Format: The format is slightly different for the two datasets, an artifact of different file systems. For IO nodes, it consists of multiple files, each with tab-separated values Timestamp FileHandle RequestType Offset Size
. A peculiarity is that reads and writes are in separate files named accordingly.
Timestamp
is a number representing the internal timestamp in nanoseconds.FileHandle
is the file handle in hexadecimal of size 64.RequestType
is the type of the request, inverted, “W” for reads and “R” for writes.Offset
is a number giving the request offset in bytesSize
is the size of the request in bytes.
265277355663 00000000fbffffffffffff0f729db77200000000000000000000000000000000 W 2952790016 32768
265277587575 00000000fbffffffffffff0f729db77200000000000000000000000000000000 W 1946157056 32768
265277671107 00000000fbffffffffffff0f729db77200000000000000000000000000000000 W 973078528 32768
265277913090 00000000fbffffffffffff0f729db77200000000000000000000000000000000 W 4026531840 32768
265277985008 00000000fbffffffffffff0f729db77200000000000000000000000000000000 W 805306368 32768
The PFS scenario has two concurrent applications, “app1” and “app2”, and its traces are inside a folder named accordingly. Each row entry has the following format: [<Timestamp>] REQ SCHED SCHEDULING, handle:<FileHandle>, queue_element: <QueueElement>, type: <RequestType>, offset: <Offset>, len: <Size>
Different from the above are:
RequestType
is 0 for reads and 1 for writesQueueElement
is never used and I believe it is an artifact of the tracing tool.
[D 01:11:03.153625] REQ SCHED SCHEDULING, handle: 5764607523034233445, queue_element: 0x12986c0, type: 1, offset: 369098752, len: 1048576
[D 01:11:03.153638] REQ SCHED SCHEDULING, handle: 5764607523034233445, queue_element: 0x1298e30, type: 1, offset: 268435456, len: 1048576
[D 01:11:03.153651] REQ SCHED SCHEDULING, handle: 5764607523034233445, queue_element: 0x1188b80, type: 1, offset: 0, len: 1048576
[D 01:11:03.153664] REQ SCHED SCHEDULING, handle: 5764607523034233445, queue_element: 0xf26340, type: 1, offset: 603979776, len: 1048576
[D 01:11:03.153676] REQ SCHED SCHEDULING, handle: 5764607523034233445, queue_element: 0x102d6e0, type: 1, offset: 637534208, len: 1048576
Where to find it: https://zenodo.org/records/3340631#.XUNa-uhKg2x
License: CC-4.0.
IBM Cloud Object Store 2019
These are anonymized traces from the IBM Cloud Object Storage service collected with the primary goal to study data flows to the object store.
The dataset is composed of 98 traces containing around 1.6 Billion requests for 342 Million unique objects. The traces themselves are about 88 GB in size. Each trace contains the REST operations issued against a single bucket in IBM Cloud Object Storage during a single week in 2019. Each trace contains between 22,000 to 187,000,000 object requests. All the traces were collected during the same week in 2019. The traces contain all data access requests issued over a week by a single tenant of the service. Object names are anonymized.
Some characteristics of the workload have been published in this paper, although the dataset used was larger:
- The authors were “able to identify some of the workloads as SQL queries, Deep Learning workloads, Natural Language Processing (NLP), Apache Spark data analytic, and document and media servers. But many of the workloads’ types remain unknown.”
- “A vast majority of the objects (85%) in the traces are smaller
than a megabyte, Yet these objects only account for 3% of the
of the stored capacity.” This made the data suitable for a cache analysis.
Format: <time stamp of request> <request type> <object ID> <optional: size of object> <optional: beginning offset> <optional: ending offset>
The timestamp is the number of milliseconds from the point where we began collecting the traces.
1219008 REST.PUT.OBJECT 8d4fcda3d675bac9 1056
1221974 REST.HEAD.OBJECT 39d177fb735ac5df 528
1232437 REST.HEAD.OBJECT 3b8255e0609a700d 1456
1232488 REST.GET.OBJECT 95d363d3fbdc0b03 1168 0 1167
1234545 REST.GET.OBJECT bfc07f9981aa6a5a 528 0 527
1256364 REST.HEAD.OBJECT c27efddbeef2b638 12752
1256491 REST.HEAD.OBJECT 13943e909692962f 9760
Where to find it: http://iotta.snia.org/traces/key-value/36305
License: SNIA Trace Data Files Download License
Access traces
Wiki Analytics Datasets 2019
The wiki dataset contains data for 1/ upload (image) web requests of Wikimedia and 2/ text (HTML pageview) web requests from one CDN cache server of Wikipedia. The mos recent dataset, from 2019 contains 21 upload data files and 21 text data files.
Format: Each upload data file, denoted cache-u
, contains exactly 24 hours of consecutive data. These files are each roughly 1.5GB in size and hold roughly 4GB of decompressed data each.
This dataset is the result of a single type of workload, which may limit the applicability, but it is large and complete, which makes a good testbed.
Each decompressed upload data file has the following format: relative_unix hashed_path_query image_type response_size time_firstbyte
relative_unix
: Seconds since start timestamp of dataset, inthashed_path_query
: Salted hash of path and query of request, bigintimage_type
: Image type from Content-Type header of response, stringresponse_size
: Response size in bytes, inttime_firstbyte
: Seconds to first byte, double
0 833946053 jpeg 9665 1.85E-4
0 -1679404160 png 17635 2.09E-4
0 -374822678 png 3333 2.18E-4
0 -1125242883 jpeg 4733 1.57E-4
Each text data file, denoted cache-t
, contains exactly 24 hours of consecutive data. These files are each roughly 100MB in size and hold roughly 300MB of decompressed data each.
Each decompressed upload data file has the following format: relative_unix hashed_host_path_query response_size time_firstbyte
4619 540675535 57724 1.92E-4
4619 1389231206 31730 2.29E-4
4619 -176296145 20286 1.85E-4
4619 74293765 14154 2.92E-4
Where to find it: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Caching
License: CC-4.0.
Memcached 2020
This dataset contains one-week-long traces from Twitter’s in-memory caching (Twemcache / Pelikan) clusters. The data comes from 54 largest clusters in Mar 2020, Anonymized Cache Request Traces from Twitter Production.
Format: Each trace file is a csv with the format: timestamp,anonymized key,key size,value size,client id,operation,TTL
timestamp
: the time when the cache receives the request, in secanonymized key
: the original key with anonymization where namespaces are preserved; for example, if the anonymized key isnz:u:eeW511W3dcH3de3d15ec
, the first two fieldsnz
andu
are namespaces, note that the namespaces are not necessarily delimited by:
, different workloads use different delimiters with different number of namespaces.key size
: the size of key in bytesvalue size
: the size of value in bytesclient id
: the anonymized clients (frontend service) who sends the requestoperation
: one of get/gets/set/add/replace/cas/append/prepend/delete/incr/decrTTL
: the time-to-live (TTL) of the object set by the client, it is 0 when the request is not a write request.
0,q:q:1:8WTfjZU14ee,17,213,4,get,0
0,yDqF:3q:1AJrrJ1nnCJKKrnGx1A,27,27,5,get,0
0,q:q:1:8WTw2gCuJe8,17,720,6,get,0
0,yDqF:vS:1AJr9JnArxCJGxn919K,27,27,7,get,0
0,yDqF:vS:1AJrrKG1CAnr1C19KxC,27,27,8,get,0
License: CC-4.0.
If you’re still here and haven’t gone diving into one of the traces linked above it may be because you haven’t found what you’re looking for. There are a few gaps that current storage traces have yet to fill:
- Multi-tenant Cloud Storage: Large cloud storage providers store some of the most rich datasets out there. Their workload reflects a large scale systems’ architecture and is the result of a diverse set of applications. Storage providers are also extra cautious when it comes to sharing this data. There is little or no financial incentive to share data with the public and a fear of unintended customer data leaks.
- Full stack. Each layer in the stack offers a different view on access patterns, none alone being enough to understand cause-and-effect relationships in storage systems. Optimizing a system to suit modern workloads requires a holistic view of the data access which are not publicly available.
- Distributed tracing. Most data is nowadays accessed remotely and managed in large scale distributed systems. Many components and layers (such as indexes or caching) will alter the access patterns. In such an environment, end-to-end means tracing a request across several components in a complex architecture. This data can be truly valuable for designing large scale systems but, at the same time, may be too specific to the system inspected which, again, limits the incentive to publish it.
- Data quality. The traces above have limitations due to the level of detail they represent. As we have seen, some have missing data, some have large granularity time stamps, others are inconveniently large to use. Cleaning data is a tedious process which limits the dataset publishing nowadays.