90% of the data in the world was created in the last two years.
That’s not all, it is growing at the rate of 40% every year. Data is being generated not only by people, but also software and machines. We’re already talking in terabytes (TB) and petabytes (PB) today, but some experts estimate that 463 exabytes (EB) of data will be produced every day by 2025.
What is the world doing with so much data? There are a lot of commercial and non-commercial applications:
-
-
- Financial, booking and other transactions in business
- Scientific computation and analysis in radiology, genomics, meteorology, seismology, etc.
- Web-based services such as cloud apps, social media, video streaming and so on
-
One thing is common among all of these – the use of data analytics to gain insights, make predictions and drive innovation, be it in an individual, institutional or business setting.
In the enterprise, data analytics is absolutely necessary to implement artificial intelligence (AI) and machine learning (ML) solutions, improve productivity, identify high-growth markets, streamline operations and provide a better customer experience.
However, the scale and unstructured nature of datasets today make it nearly impossible for traditional IT infrastructure, applications and database management systems to process and analyze data quickly or cost effectively.
Scores of new technologies are being developed to address this challenge, including hybrid cloud architectures, edge/distributed computing, IoT, databases that deal with a wide variety of data formats and queries, massive parallel processing and so on. These make huge demands of the underlying storage and data processing infrastructure – big data needs powerful CPUs with multiple cores, faster memory, more bandwidth and of course reliable storage with higher capacities that can be accessed and written faster.
Taken as a whole, the speed of data storage and processing depends more on the format of the data and the applications that access it than where it is stored. And counterintuitively, that makes the storage drive even more important to data analytics.
Enter solid-state drives (SSDs)
SSDs have slowly but surely emerged as the de facto choice for ultra-fast storage in the enterprise, especially where a lot of data processing is involved. Further, most analytics platforms today run on the cloud, where users access it on an as-needed basis. However, the cloud service providers’ data centers (where the actual analytics workload is hosted) also benefit from acceleration methods such as parallelization (running multiple, concurrent data processes) and shuffling (increasing the volume of transition data processed by applications), which are supported by NAND-flash based SSDs.
Critically, SSDs also offer a price-to-performance ratio that sits snugly between DRAM and HDDs. The cost-per-bit is quite lower than DRAMs but the difference in access times and bandwidth is closing rapidly. On the other hand, SSDs might be more expensive than HDDs in cost-per-GB but the I/O performance is several orders of magnitude higher, leading to a lower cost-per-IOPS.
Best of all, the pricing of NAND flash memory (the building blocks of SSD) is projected to fall faster than other media, and eventually matching HDDs in $/GB for some category of products, sweetening the SSD value proposition even more.
So what advantages do SSDs bring to the table for enterprises running data analytics applications?
Benefits of using SSDs for data analytics
The right kind of SSD for big data applications can give you gains of up to 70% in speed and performance. Here are some salient features of SSDs that are almost tailor-made for analytics:
Performance
Analytics applications tend to be read-intensive and pull huge amounts of data recursively from sequential reads. In many enterprise systems, storage I/O is a huge bottleneck for doing this. Multicore CPUs simply idle while random or even when sequential I/O processes are taking place. However, SSDs are fast enough to match CPU throughput and let the application process data and analytics at full capacity. This makes SSDs ideal for the I/O-bound component of big data analytics.
Non-volatility
SSDs retain data when power is turned off just like HDDs even though they’re built with flash cells. Unlike DRAM, they don’t need destaging.
Flexibility
Analytics apps have different requirements, depending on the kind of data they process and output as well as the infrastructure they run on. SSDs are available in a variety of form factors and interfaces (such as PCIe and SATA).
Reliability
SSDs are built with NAND flash cells, which wear out only when they’re written to. However, today’s enterprise-class SSDs are superfast and perform consistently well for write-intensive workloads. Most SSDs have a mean time to failure (MTTF) rate of 1 to 2 million hours, which outlasts the average human lifetime.
Big data and analytics applications are often characterized by mixed read/write workloads that demand IOPS on a massive scale with very low latency. These requirements can only be met by enterprise-grade SSDs.
Low power consumption
Since SSDs contain no spinning disks or other moving parts, they consume far less power per device. This leads to overall savings in power and cooling expenses at the data center or on-premises infrastructure, especially when large-scale transactions are happening in the system, resulting in massive data generation and processing needs.
Intelligent caching
SSDs in the host server can act as level-2 caches to hold data when it is moved out of memory – the software determines which blocks of data need to be stored in the cache. SSDs can also reside in a shared network appliance with network caching that accelerates all storage systems behind it. Here too, there are two types of caches: out-of-band (read-only) and in-band (write-back).
Low latency
PCIe-based SSDs running on protocols such as NVMe leverage the full power of the hardware as well as the application, and keep data flowing through the system at breakneck speed. They have the lowest latency rates owing to the complete absence of host controllers or adapters.
Customized solutions from Phison for data analytics
Phison is known for its customizable SSD solutions that drive a variety of enterprise workloads, most of which have in-built analytics as an integral part of the application. These SSDs are pushing the boundaries of speed, performance and capacity while delivering just the results that enterprises want.
In 2019, Phison launched the world’s first PCIe Gen4x4 NVMe SSD solution – the E16 controller that set new performance records for storage with 5.5 GB/s for sequential reads and 4.4 GB/s for sequential writes. Just a year later, the second generation E18 controller became the fastest PCIe Gen4x4 NVMe SSD solution in the world, upping the standard to 7.4 GB/s for sequential reads and 7.0 GB/s for sequential writes.
For read-intensive analytics applications with extremely large-scale storage requirements, Phison’s S12DC controller provides a customizable and upgradable platform for SSDs with capacities up to 15.36 TB.
Taken as a single unit, storage arrays built with SSDs from Phison can provide some critical benefits to data analytics applications:
-
-
- Phison’s customized PCIe Gen4 SSD solutions separate storage from compute and do away with limits set by legacy controllers. This means training and control sets for machine learning can scale up to 1 PB without affecting performance.
- Phison’s NVMe SSD controllers also allow for dynamic provisioning of volumes over high-performance Ethernet networks.
- The high-speed, low-latency storage controllers allow every GPU node to have direct, parallel access to the media. This can make epoch times of ML algorithms up to 10x faster.
-
Data and analytics make or break a business today. Every facet of business – including entering new markets, launching new products, optimizing the supply chain and generating new revenue streams – requires some form of analytics and data governance. And as we’ve seen, the role of IT infrastructure at large and SSDs in particular cannot be ignored when it comes to ensuring the timeliness, usefulness and reliability of data.