The data field is full of vague terms and many are between the wheel to get to that’s what these all mean. What is the difference between artificial intelligence and machine learning? And are Data Lake and Big Data the same thing?

The terms are sometimes used a little misleadingly and it is no wonder that the uninitiated sometimes find them difficult. The whole industry is also changing so fast that there are no officially approved definitions.

This list is for anyone who works with data or is otherwise interested in using the data for business development.

This list is made to be shared and exploited by all. We will also update them whenever new terms come up. The list will certainly never be perfect, as the industry is, of course, constantly evolving.

Also please check out our comprehensive range of data related trainings from top trainers from all over the world.

DATAPEDIA – DATA FIELD TERMINOLOGY

AI (machine intelligence, artificial intelligence) is often an umbrella term for a program or algorithm that has human-like features in learning. General AI refers to advanced learning and human computing – this level does not yet exist. Narrow AI refers to narrower, but currently well-functioning artificial intelligence. Its practical applications are based on the automation of analytics without the individual programming the measures and steps of analytics. Examples include face and speech recognition and language translation; or software following the production line learns to define failure predictive metrics without human definition.

Algorithm is a detailed description or instruction of how a task or process is performed; which can be followed to solve a specific problem. In data processing, it means the automation described above, and in the context of machine learning, we often talk about learning algorithms, i.e. the algorithm develops itself based on feedback.

Big Data as a term refers to the storage and processing of large masses of data, including unstructured data such as text, image, video, and audio data. New technologies have been developed for Big Data, the first of which was Hadoop. Now Big Data can be found in big vendor solutions packaged. These are cheaper and easier to store large data sets compared to relational databases without knowing the data structures in advance, but retrieving the data is more challenging.

Business Glossary is like a “dictionary” in an organization’s data area that describes and defines key business terms such as a customer, products, or project. Business Glossary assists in many data projects, such as Data Governance programs and or the launch of a data platform. Concept modeling is one good method for compiling it.

Business Intelligence (BI) is business reporting, analysis and visualization utilizing BI software. Examples of such software are Power BI, Tableau, Looker, and Thoughspot. Self Service BI means that a business can analyze information that is important to it without the support of an IT expert. BI helps companies grow their business based on better use of information.

Business / Data Dashboard is like a car dashboard that aims to track key business variables, KPIs, in real time. BI software usually provides such.

Concept modeling (entity modeling, entity relationship, ER) is a method for modeling concepts and data in a specific target area from a business perspective, in an application- and technology-independent way. The result is a graphical concept model describing the concepts and their interdependencies, as well as definitions of the concepts. Helps in communication, e.g. between business and IT or with vendors. Common definitions help to “speak the same language”. The concept models are useful as such in business data mapping, and they also advance for example to database solutions, data warehousing or ERP system selection.

Cloud services (Internet services, cloud computing) is the operation of systems, databases or files on the Internet servers, as an alternative to organizations’ own servers. In the data warehouse area, examples are e.g. Snowflake, Amazon Redshift, Azure SQL Data Warehouse, and several BI software. Essential for Big Data solutions, with a significant reduction in deposit costs; as a deposit solution in this case often Hadoop. The advantages are fast start-up, easy scalability up and down and payment only for the services needed at any given time. The largest providers of cloud services are Amazon, Microsoft and Google.

Data analytics (big data analytics) is both descriptive and predictive analytics. If analytics functions are automated, it is also called Advanced Analytics. Descriptive Analytics is mainly summaries but Predictive Analytics refers to business forecasting using statistics and mathematical models. Examples of Data Analytics are customer exit analysis, machine and equipment failure prediction, or health care outbreak prediction.

Database is a set of related data stored in a way that the data can be shared. Implemented with Database Management System, the most common of which are Oracle, SQL Server, MySQL, DB2, PostgreSQL and MongoDB. The majority of current databases are relational databases, a smaller proportion are NoSQL databases; previously also used the so-called network-based and hierarchical databases. A typical operating system consists of a program component and an underlying database. The basic implementation of data warehouses.

Data catalog helps companies find out what data it has, where it is located, and how it relates to each other. It can be called an organization’s “metadata library”. Data catalogs have grown in popularity in recent years precisely because companies have a record number of IT systems, applications, databases, and files at their disposal. Data is “hidden” in these systems and data catalogs help with this challenge.

Data Cleaning means checking the data stored in the data warehouse and correcting or deleting any incorrect, incomplete or duplicate data. It is said that Data Science is 80% data purification and 20% self-analysis. This is because the data is in many different formats and in many cases incorrect.

Data Governance (Information management model) refers to the policies, guidelines, and rules by which an organization handles, stores, and utilizes data. It also includes roles, responsibilities and enforcement of regulations (eg GDPR).

Data Lake (data pool, data pool) is an architectural solution for file-based storage of data. Fast stores large amounts of data and also unstructured data. Due to the low cost of storage, data can be collected extensively and only later can it be decided what is really needed. Not so well suited for standard and production reporting (vs data warehousing and BI). Indeed, there is often talk of supplementing the data warehouse with a data platform, not replacing it. Can act as a data storage loading dock (Staging area). Data Scientists, researchers, and AI / ML developers are typical users of Data Lake.

Data Lakehouse is a term popularized by Databricks for a data architecture that integrates a data lake and a data warehouse into a single platform environment. Today, its users are primarily data scientists and AI / ML developers. approach is also popular with developers of BI and other production data analytics.

Deep Learning is an area of machine learning that seeks to mimic brain function most commonly by using so-called neural networks. Breakthroughs in the development of artificial intelligence, for example in the fields of machine vision, speech recognition and language translation, have led in part to similar or better results than human experts; The rapid development of Deep Learning methods is key to this.

Data management is an umbrella term that includes the development, management, operations and practices related to data utilization throughout the information lifecycle. The area includes e.g. data strategy, data governance, concept modeling, data warehousing, security issues, data architecture, storage solutions and data quality.

Data Mapping (Building a data map, information map) is a method that provides an overall picture of a company’s and organization’s data resources or parts of it. Helps to describe the key concepts of the organization in a clear way and to improve the communication between IT and business. The goal is to increase and document scattered data understanding (and thus business understanding), which is central to all digitization and data projects in an organization, such as information system acquisition or expansion, data warehousing / BI project and Big Data development.

Datamart is a local data warehouse where data is processed to facilitate reporting. Can be designed, for example, with star modeling or so-called as a wide board. BI products are usually good at reading Datamarts. Can be separate or derived from the data warehouse.

Data Mesh is a new concept that describes a distributed data architecture (cf. centralized data architecture). Central to it are APIs, micro services, data products, so-called domain design data (business-oriented design). Data Mesh has gained great popularity in a short time and is used by companies such as Netflix, Zalando and other online services.

Data modeling is a method of describing data and the dependencies between them, with the aim of designing the structure of the database. Usually the next step after concept modeling. Specialized modeling methods are used in the design of data warehouses, such as dimensional design (star model) and Data Vault.

Data Platform is a versatile platform solution for data recovery. It often refers to Ms. Azure, Amazon’s AWS, or Google’s GCP. These vendors have their own native applications for AI / ML development, application development, data warehousing, pools, and BI reporting, among many others. The strengths are versatility and easy implementation, the weaknesses are the deficient properties of certain products and the “cooperation” between ecosystems.

Data Quality Management is concerned with ensuring data quality. In order for the data to be analyzed and utilized, the quality of the data must be adequate. The Data Quality process includes data testing and quality management guidance. This also requires organized roles and work tasks, such as data stewards. Utilizing machine learning also requires good enough data quality.

Data Science is related to Data Analytics, but it emphasizes a research approach; hypotheses are tested according to principles familiar from science. Among other things, data analysis based on experimental setups using statistical methods and utilizing machine learning. Data Scientist examines data and finds patterns. The key skill of a Data Scientist is to find new questions that after being answered developed the business further.

Data Steward is a designated person within an organization (in business) who, as a data content expert, is responsible for data in a specific area or organizational unit, including quality control. The data steward helps his or her close parties to utilize the data and participates in the development of data management solutions in his or her own area.

Data Virtualization (logical datawarehouse) technology can be used to create a combined view of many data sources. The software includes e.g. Denodo, ReHat JBoss and TIBCO. For example, several relational databases and Excel, XML and Hadoop files can be attached to them. Descriptions are defined as overlapping views, with everything at the top level appearing as tables, queryable in SQL, and countless BI tools. So the data can be combined, keeping it in its original locations. Can supplement or even replace the physical data warehouse. Performance can be a challenge (data is retrieved and combined “on the fly”). Does not replace the data warehouse’s ability to record data.

Data Warehouse (DW) is a separate database into which data scattered across different systems is extracted and loaded for reporting, analytics, and other uses. The idea is to combine the data and make it available, easy to query, for example the customer’s so-called 360 view. Implementation is usually in relational databases, in the design of their own methods such as the star model and Data Vault. See also Data Platform. Data at an accurate level (not aggregated) typically on a daily basis, even in real time if necessary. Data warehousing can also be used to manage data history. It is not a single technology, but architecture.

Enterprise data warehouse (EDW) is a corporate or enterprise-level centralized data warehouse where the idea is to combine and integrate data extensively from different enterprise information systems to get an overall picture. What is important is scalability, ie the EDW is gradually being expanded by bringing in data from the systems of different organizational units. Requires good planning, ie data mapping and concept and data modeling.

ETL means transferring and editing and loading data: the data is extracted from the source system, transformed and finally loaded into the data warehouse. In the download process, the data is converted to the form of a data warehouse structure, while integrating data from different output systems. The process usually also includes data history. For example, a typical ETL process could involve integrating customer data from a customer application into a company’s centralized data warehouse.

Information architecture is a holistic architecture perspective that seeks to identify an organization’s information needs at the levels of strategic, tactical, and operational management, and describes the classification, structures, origins, and flow of information in processes and information systems. The information architecture also supports data management and processing from data to information and knowledge. It is used by both information, process and application architects.

Internet of Things (IoT, internet of everything) is the connection of machines and devices to the Internet and the connection of data generated by machines to other data. Now and in the future, almost all devices will be connected to the Internet, such as smartphones already are, and analyzing this data brings a lot of opportunities.

Machine learning (machine Intelligence) is a field of computer science where a software or system “learns” using data without programming. Machine learning utilizes computational methods and statistics familiar from data analytics to gradually improve the performance of a given task. The Deep Learning method falls into this category.

Master Data Management (MDM) has arisen from the need for better management of basic information that is fragmented, stored multiple times, and often at different levels. Master Data is long-lived, slowly changing information of interest to many units of a company or organization; as if “basic registers” such as product and customer information. Transaction data is not Master data. The Master Data Management includes processes to keep these important common information in the organization more up-to-date and of high quality.

Metadata is information about information, ie descriptive and defining information about a data resource or content unit. For example text document information (last saved, owner, version, location, etc.). Definitions of concepts in a concept model are also metadata. A key role in improving the use of information resources; well-defined metadata is the so-called documentation of tacit information. They can facilitate data transfer between information systems and the integration of content in different places. Thus, high-quality metadata can also improve the discoverability of information so that search engines can search for information more accurately and comprehensively.

NoSQL databases are databases that are not relational database-based and do not, in principle, support SQL, such as relational databases. NoSQL is often interpreted as “Not Only SQL”. Instead of a table structure, they have some other deposit structure, such as a documentary, so-called key-value-, or graph structure. Examples are MongoDB and Neo4j. Typically, they enhance some area of operational activity and are often more scalable than relational databases. On the other hand, relational databases have gradually possessed many features of NoSQL databases.

Open source refers to the methods of producing and developing computer programs, giving the user the opportunity to familiarize themselves with the source code and modify it according to their own needs. The principles also include the freedom to use the Program for any purpose and to copy and distribute both the original and the modified version. A significant proportion of new software is implemented in this way, often as large-scale volunteer collaboration projects. Examples: MariaDB of Finnish origin and Hadoop.

Relational database is a database technology based on the relational model published by E.F. Codd in 1970, which began to displace earlier types of databases in the 1990s. Relational databases are used by using SQL. Almost all commercial information systems and also data warehouses are built on top of relational databases. The data is stored in the tables that make up the database. The most common products are Oracle, SQL Server, MySQL, DB2 and PostgreSQL. Specialists in data storage include e.g. Teradata, Redshift and Snowflake.

SQL language is expressive and widespread, practically the only language for querying and processing relational databases. Developed in an IBM lab in the 70’s. A non-procedural 4th generation language that defines what data is retrieved, not how the data is retrieved (it is decided by the relational database optimizer). SQL interfaces have also been made over Hadoop and many NoSQL databases. Serves as a good interface between relational databases and countless tools (e.g. BI and ETL products).

Also please check out our comprehensive range of data related trainings from top trainers from all over the world.