The volume of data being generated on a daily basis is truly impressive, and it is expected to continue growing at an exponential rate in the coming years. It is estimated that every day 2,5000,000 TB of data are created all over the world1, generated from a number of digital tools and platforms used for communication, entertainment, and business. This includes data from social media posts, online searches, emails, mobile phone usage, and IoT devices, as well as data generated by sensors and other specialized equipment in various industries.
Apart from being generated at ever higher rates, data is also in high demand, being consumed by several sources daily. From medical appointments to online shopping or just subscribing to websites, blogs, and online periodicals. Do we really know where our data resides or if it’s being shared, yet we hope and pray that it’s protected?
Studies show that the majority of data management decision makers believe their data is difficult to interpret and lacks observability, with challenges in fully understanding what data they currently hold, how it is used, and who the owner is2.
The high demand for data drives the trends in storage formats and repositories - should you choose a data lake, warehouse, or a Postgres database? The decision needs to be workload-specific, but you must understand the differences between each option.
What is a data lake?
A data lake is a repository of unstructured data collected from various sources, which allows for raw data without a defined format captured from IoT and business applications. The data is typically used for full-text searches and analytics.
Due to the various input sources, the data is usually not clean and can be very complex to consume. In most cases, it will take a data scientist to build the complex queries required to deliver reports. Business-level professionals are attracted to data lakes due to their data diversity and speed of data consumption capabilities, features that have led to their increased adoption, along with their ability to store extremely large data volumes.
What is a data warehouse?
A data warehouse is a repository of semi-structured data, where the primary use case is reporting.
The data sets are traditionally composed of historical transactional data, which is queried for custom report generation. The data sources, being mainly operational in nature, are moved over via ETL tools to group the data into semi-structured formats. Although a data warehouse is frequently referred to as a DSS, or decision support system, there are 3 types of data warehouses. Denormalization or the use of the third normal form (3NF) is the typical trait.
|Enterprise data warehouse||A central repository combining data from multiple business units or departments.|
|Operational data store||Contains operational reporting data and metrics used to maintain the business, such as personal time and office location details.|
|Data mart||A divisional category within a warehouse typically housing data related to a particular business unit.|
What is a database?
A database is a structured filing system to store data. Often when the term database is mentioned, it’s related to relational or RDBMS, but there are many types of databases.
Relational gained popularity in the 1980s, but was later followed by object-oriented and NoSQL databases. As database technology improves, new types gain popularity, with the likes of cloud, graph and - would you believe it - self-driving databases.
Upon deciding the use case and selecting the database type of your choice, the true challenge of understanding who needs access and how secure the data will be a major task.
|Accessing data||Securing data|
|Query access for ad-hoc reporting||Read/write access requirements|
|Reporting tools||Data masking requirements|
|Third-party app integration||Encryption requirements|
|Data exports||Auditing and compliance requirements|
As you begin to focus on the tasks required, it can often be easier to harness your data by understanding the lifecycle and business processing needed for strategic decision making.
From an executive level, terms like roles, access, and masking may seem puzzling, as the audience may not all have the same technical skill, nor should they be expected to. Which means we must focus on tasks at a higher level, by grouping them into more business-related categories.
|Communication||Connectors, drivers, interfaces, tooling, languages|
|Ingestion||Sources of data, workflows, unstructured data|
|Analysis||Reporting, Machine Learning, Predictive Analytics & Modeling|
|Storage||Location-lake, warehouse, database, data standards, sizing|
|Investment||Open source, open source-supported, closed source-supported.|
How does your data impact the business?
Fully understanding the business categories allows you to focus on the areas where your business is impacted, which is the foundation for being a data-driven organization.
For example, the analysis category is leveraged for the decision-making process. Analyzing data results in increased customer satisfaction, focused sales campaigns, and improved operations.
The local grocer can be used as a great example of using analysis to drive sales. The weekly sales and coupons on items should be derived from data that shows what markets that item is typically sold in. In this way, analyzing existing data produces sales and increases the volume of shipments to that store. Using data to maintain customer retention by carrying specific products will also impact the supply chain.
The impact of data shown through customer insights can be shared with other manufacturers or business partners, from whom you may also receive data. The ingestion of the data may be in the form of flat files or unstructured data. This process needs to be well managed as you stream data in and out of your pipelines.
The method and time it’ll take to move data through the pipeline are critical. Utilizing efficient front- and back-end tools will define success. Overall, ingestion methods should incorporate three main benefits that can aid you in both real-time and batch-based ingestions:
- Continuous integration
- Reduction in time
- Flexible architecture
The most important of the benefits may be flexible architecture, which needs to be defined before your process begins, along with the database type chosen for storing the data.
One known best practice for a data ingestion pipeline is protection of critical data. Among the various types of open source-based solutions, Fujitsu Enterprise Postgres can offer some key features to help harness and protect critical datasets.
Have you implemented data governance?
Achieving compliance is a critical aspect of your data governance strategy, and should be embedded into your framework. Fujitsu Enterprise Postgres provides the capabilities for data protection, allowing structured and unstructured data to be stored in encrypted tablespaces utilizing 256-bit encryption to assist in achieving PCI DSS compliance levels. Encryption and decryption are performed by manipulating data blocks instead of individual bits, resulting in minimal overhead.
Complementing the benefit of a flexible architecture, the Transparent Data Encryption (TDE) functionality does not add additional storage areas, because the size of the encrypted objects is not modified, and it also allows backups and logs to include the encrypted version, with no additional licensing required.
Is your data out of control?
In order to become data-driven, an organization must take control of its data through a cultural adoption of this strategy. It must take into account that in a typical environment, data is ingested from multiple sources, and control processes are mandatory.
Fujitsu Enterprise Postgres can be part of this solution as it provides, among others, the following relevant capabilities:
- Full-text searches, provided by the pg_bigm extension.
- Table reorganization (by deleting bloated tables and indexes and reclaiming unnecessary areas) to aid in the control and flexibility of the data, provided by the pg_repack extension.
- Increased performance as you move data from flat files to cleanse data prior to loading and building indexes, provided by the High-Speed Data Load, which uses multiple parallel workers to simultaneously perform data conversion, table creation, and index creation.
- Take advantage of a columnar index structure to enhance performance, provided by the Vertical Clustered Index.
Harnessing your data requires teamwork and collaboration so you can gain full knowledge of your data, its types, and where it comes from. Once your organization accomplishes that, it can decide how to best leverage and share that data both inside and outside of the organization.
Imagine if a major grocer had no control over their data, no way to understand the typical shopper and how they discover new products - it would be difficult for them to drive target sales. This could be the difference between the major grocer and the average grocer.
Without the data management controls and the ability to coherently harness your data, your investments may end up just like it, out of control.
And one more thing
An additional method you can use to harness your data is to utilize a powerful database management tool - I will be discussing this further in my blog post next week. Be sure to check it out.