Big Data terms can get very confusing, really quickly. For those new to Big Data and Data Analytics, here is a quick glossary list of terms to help people understand. I’m sure there are more terms but these are our favorites.
- ACID Test: A test applied to data for atomicity, consistency, isolation, and durability.
- Ad Targeting: The attempt to reach a specific audience with a specific message, typically by either contacting them directly or placing contextual ads on the Web.
- Ad-Hoc Reporting: Reports generated for a one-time need.
- Aggregation: Collecting data from various databases for the purpose of data processing or analysis.
- Algorithm: A mathematical formula placed in software that performs an analysis on a set of data.
- Analytics: Using software-based algorithms and statistics to derive meaning from data.
- Analytics Platform: Software or software and hardware that provides the tools and computational power needed to build and perform many different analytical queries.
- Anomaly Detection: The process of identifying rare or unexpected items or events in a dataset that do not conform to other items in the dataset.
- Anonymization: The severing of links between people in a database and their records to prevent the discovery of the source of the records.
- API: An abbreviation for Application Program Interface. a set of programming standards and instructions for accessing or building web-based software applications.
- Application: Software that is designed to perform a specific task or suite of tasks.
- Artificial Intelligence: The apparent ability of a machine to apply information gained from previous experience accurately to new situations in a way that a human would.
- Automated Analysis: Automatic analysis of data to find hidden insights in the data and show users the answers to questions they have not even thought of yet.
- Automatic Identification And Capture (AIDC): Any method of automatically identifying and collecting data on items, and then storing the data in a computer system. For example, a scanner might collect data about a product being shipped via an RFID chip.
- Behavioral Analytics: Using data about people’s behavior to understand intent and predict future actions.
- BI Analyst: As stated by modernanalyst.com, a data analyst is a professional who is in charge of analyzing and mining data to identify patterns and correlations, mapping and tracing data from system to system in order to solve a problem, using BI and data discovery tools to help business executives in their decision making, and perform statistical analysis of business data, among other things. (Can be called a Data Analyst too)
- BI Governance: According to Boris Evelson, from Forrester Research, BI governance is a key part of data governance, but if focuses on a BI system and governs over who uses the data, when, and how.
- Big Data: Enormous and complex data sets that traditional data processing tools cannot deal with.
- Bottlenecks: Points of congestion or blockage that hinder the efficiency of the BI system.
- Brand Monitoring: The act of monitoring your brand’s reputation online, typically by using software to automate the process.
- Brontobyte: A unit that represents a very large number of bytes. Brontobyte has been proposed for a unit of measure for data beyond yottabyte scale, but is not yet an officially recognized unit.
- Business Intelligence (BI): The general term used for the identification, extraction, and analysis of data.
- Call Detail Record (CDR) Analysis: CDRs contain data that a telecommunications company collects about phone calls, such as time and length of call. This data can be used in any number of analytical applications.
- Cassandra: A popular choice of columnar database for use in big data applications. It is an open source database managed by The Apache Software Foundation.
- Cell Phone Data: Cell phones generate a tremendous amount of data, and much of it is available for use with analytical applications.
- Centralized Business Intelligence: A BI model that enables users to work connected and share insights, while seeing the same and only version of the truth. IT governs over data permissions to ensure data security.
- Classification Analysis: Data analysis for the purpose of assigning the data to a particular group or class.
- Clickstream Analytics: The analysis of users’ Web activity through the items they click on a page.
- Clojure: Clojure is a dynamic programming language based on LISP that uses the Java Virtual Machine (JVM). It is well suited for parallel data processing.
- Cloud: A broad term that refers to any Internet-based application or service that is hosted remotely.
- Clustering Analysis: Data analysis for the purpose of identifying similarities and differences among data sets so that similar data sets can be clustered together.
- Collaborative BI: An approach to Business Intelligence where the BI tool empowers users to collaborate between colleagues, share insights, and drive collective knowledge to improve decision making.
- Collective Knowledge: Knowledge that benefits the whole enterprise as it comes from the sharing of insights and data findings across groups and departments to enrich analysis.
- Columnar Database or Column-Oriented Database: A database that stores data by column rather than by row. In a row-based database, a row might contain a name, address, and phone number. In a column-oriented database, all names are in one column, addresses in another, and so on. A key advantage of a columnar database is faster hard disk access.
- Comparative Analysis: Data analysis that compares of two or more data sets or processes to identify patterns in large data sets.
- Competitive Monitoring: Keeping tabs of competitors’ activities on the Web using software to automate the process.
- Complex Event Processing (CEP): CEP is the process of monitoring and analyzing all events across an organization’s systems and acting on them when necessary in real time.
- Complex Structured Data: Structured data that comprise two or more interrelated parts and, therefore, are difficult for structured query languages and tools to process.
- Comprehensive Large Array-Data Stewardship System (CLASS): A digital library of historical environmental data from satellites operated by the U.S. National Oceanic and Atmospheric Association (NOAA).
- Computer-Generated Data: Any data generated by a computer rather than a human–a log file for example.
- Concurrency: The ability to execute multiple processes at the same time.
- Confabulation: The act of making an intuition-based decision appear to be data-based.
- Content Management System (CMS): Software that facilitates the management and publication of content on the Web.
- Correlation Analysis: A means to determine a statistical relationship between variables, often for the purpose of identifying predictive factors among the variables. Correlation refers to any of a broad class of statistical relationships involving dependence. Familiar examples of dependent phenomena include the correlation between the physical statures of parents and their offspring, and the correlation between the demand for a product and its price.
- Cross-Channel Analytics: Analysis that can attribute sales, show average order value, or the lifetime value.
- Crowdsourcing: The act of submitting a task or problem to the public for completion or solution.
- Customer Relationship Management (CRM): Software that helps businesses manage sales and customer service processes.
- Dark Data: According to Gartner, the definition for Dark Data is “information assets that organizations collect, process and store in the course of their regular business activity, but generally fail to use for other purposes”. 90% of companies’ data is dark data.
- Dashboard: A graphical reporting of static or real-time data on a desktop or mobile device. The data represented is typically high-level to give managers a quick status reports, KPIs, or performance.
- Data: A quantitative or qualitative value. Common types of data include sales figures, marketing research results, readings from monitoring equipment, user actions on a website, market growth projections, demographic information, and customer lists.
- Data Access: The act or method of viewing or retrieving stored data.
- Data Act: The Digital Accountability and Transparency Act of 2014. The U.S. law is intended to make information on federal government expenditures more accessible by requiring the Treasury Department and the White House Office of Management and Budget to standardize and publish U.S. federal spending data.
- Data Aggregation: The act of collecting data from multiple sources for the purpose of reporting or analysis.
- Data Analyst: A person responsible for the tasks of modeling, preparing, and cleaning data for the purpose of deriving actionable information from it.
- Data Analytics: The application of software to derive information or meaning from data. The end result might be a report, an indication of status, or an action taken automatically based on the information received.
- Data Architecture and Design: How enterprise data is structured. The actual structure or design varies depending on the eventual end result required. Data architecture has three stages or processes
- Data Center: A physical facility that houses a large number of servers and data storage devices. Data centers might belong to a single organization or sell their services to many organizations.
- Data Cleansing: The act of reviewing and revising data to remove duplicate entries, correct misspellings, add missing data, and provide more consistency.
- Data Collection: Any process that captures any type of data.
- Data Custodian: A person responsible for the database structure and the technical environment, including the storage of data.
- Data Democratization: The notion of making data available directly to workers throughout an organization, as opposed to having that data delivered to them by another party, often IT, within the organization.
- Data Exhaust: The data that a person creates as a byproduct of a common activity–for example, a cell call log or web search history.
- Data Feed: A means for a person to receive a stream of data. Examples of data feed mechanisms include RSS or Twitter.
- Data Governance: A set of processes or rules that ensure the integrity of the data and that data management best practices are met.
- Data Integration: The process of combining data from different sources and presenting it in a single view.
- Data Integrity: The measure of trust an organization has in the accuracy, completeness, timeliness, and validity of the data.
- Data Management: According to the Data Management Association, data management incorporates the following practices needed to manage the full data lifecycle in an enterprise
- Data Management Association (DAMA): A non-profit international organization for technical and business professionals “dedicated to advancing the concepts and practices of information and data management.”
- Data Marketplace: A place where people can buy and sell data online.
- Data Mart: The access layer of a data warehouse used to provide data to users.
- Data Mashup: An integration multiple data sets in a unified analytical and visual representation.
- Data Migration: The process of moving data between different storage types or formats, or between different computer systems.
- Data Mining: The process of deriving patterns or knowledge from large data sets.
- Data Model, Data Modeling: A data model defines the structure of the data for the purpose of communicating between functional and technical people to show data needed for business processes, or for communicating a plan to develop how data is stored and accessed among application development team members.
- Data Point: An individual item on a graph or a chart.
- Data Profiling: The process of collecting statistics and information about data in an existing source.
- Data Quality: The measure of data to determine its worthiness for decision making, planning, or operations.
- Data Replication: The process of sharing information to ensure consistency between redundant sources.
- Data Repository: The location of permanently stored data.
- Data Science: A recent term that has multiple definitions, but generally accepted as a discipline that incorporates statistics, data visualization, computer programming, data mining, machine learning, and database engineering to solve complex problems.
- Data Scientist: A practitioner of data science.
- Data Security: The practice of protecting data from destruction or unauthorized access.
- Data Set: A collection of data, typically in tabular form.
- Data Silos: According to Tech Target, a data silo is “data that is under the control of one department or person and is isolated from the rest of the organization.” Data silos are a bottleneck for effective business operations.
- Data Sources: The source where the data to be analyzed comes from. It can be a file, a database, a dataset, etc. Modern BI solutions like Necto can mashup data from multiple data sources.
- Data Steward: A person responsible for data stored in a data field.
- Data Structure: A specific way of storing and organizing data.
- Data Visualization: The graphic visualization of data. Can include traditional forms like graphs and charts, and modern forms like infographics.
- Data Warehouse: A relational database that integrates data from multiple sources within a company.
- Data-Directed Decision Making: Using data to support making crucial decisions.
- Database: A digital collection of data and the structure around which the data is organized. The data is typically entered into and accessed via a database management system (DBMS).
- Database Administrator (DBA): A person, often certified, who is responsible for supporting and maintaining the integrity of the structure and content of a database.
- Database As A Service (DaaS): A database hosted in the cloud and sold on a metered basis. Examples include Heroku Postgres and Amazon Relational Database Service.
- Database Management System (DBMS): Software that collects and provides access to data in a structured format.
- De-Identification: The act of removing all data that links a person to a particular piece of information.
- Demographic Data: Data relating to the characteristics of a human population.
- Embedded Analytics: The integration of reporting and data analytic capabilities in a BI solution. Users can access full data analysis capabilities without having to leave their BI platform.
- Excel Hell: A situation where the enterprise is full of unnecessary copies of data, thousands of spreadsheets get shared, and no one knows with certainty which is the most updated and real version of the data.
- Federated Business Intelligence: A BI model where users work in separate desktops, creating data silos and unnecessary copies of data, leading to multiple versions of the truth.
- Geo-Analytic Capabilities: The ability that a BI or data discovery tool has to analyze data by geographical area and reflect such analysis on maps on the user’s dashboard.
- In-Database Analytics: The integration of data analytics into the data warehouse.
- In-Memory Database: Any database system that relies on memory for data storage.
- Infographics: Visual representations of data that are easily understandable and drive engagement.
- Information Management: The practice of collecting, managing, and distributing information of all types–digital, paper-based, structured, unstructured.
- Insights: According to Forrester Research, insights are “actionable knowledge in the context of a process or decision.”
- Internet of Things (IoT): The network of physical objects or “things” embedded with electronics, software, sensors and connectivity to enable it to achieve greater value and service by exchanging data with the manufacturer, operator and/or other connected devices. Each thing is uniquely identifiable through its embedded computing system but is able to interoperate within the existing Internet infrastructure.
- Kafka: LinkedIn’s open-source message system used to monitor activity events on the web.
- KPI: Key Performance Indicator. A quantifiable measure that a business uses to determine how well it meets the set operational and strategic goals. KPIs give managers insights of what is happening at any specific moment and allow them to see in what direction things are going.
- Latency: Any delay in a response or delivery of data from one point to another.
- Legacy System: Any computer system, application, or technology that is obsolete, but continues to be used because it performs a needed function adequately.
- Linked Data: As described by World Wide Web inventor Time Berners-Lee, “Cherry-picking common attributes or languages to identify connections or relationships between disparate sources of data.”
- Load Balancing: The process of distributing workload across a computer network or computer cluster to optimize performance.
- Mashup: The process of combining different datasets within a single application to enhance output–for example, combining demographic data with real estate listings.
- Massively Parallel Processing (MPP): The act of processing of a program by breaking it up into separate pieces, each of which is executed on its own processor, operating system, and memory.
- Master Data Management (MDM): Master data is any non-transactional data that is critical to the operation of a business–for example, customer or supplier data, product information, or employee data. MDM is the process of managing that data to ensure consistency, quality, and availability.
- Metadata: Any data used to describe other data–for example, a data file’s size or date of creation.
- Modern BI: An approach to BI using state of the art technology, providing a centralized and secure platform where business users can enjoy self-service capabilities and IT can govern over data security.
- Mongodb: An open-source NoSQL database managed by 10gen.
- MPP Database: A database optimized to work in a massively parallel processing environment.
- Multi-Threading: The act of breaking up an operation within a single computer system into multiple threads for faster execution.
- Multidimensional Database: A type of database that stores data as multidimensional arrays, or “cubes,” as opposed to the rows and column sotrage structure of relational databases. This enables data to be analyzed from different angles for complex queries and analytical processing (OLAP) applications.
- Natural Language Processing: The ability of a computer program or system to understand human language. Applications of natural language processing include enabling humans to interact with computers using speech, automated language translation, and deriving meaning from unstructured data such as text or speech data.
- No SQL: A class of database management system that does not use the relational model. NoSQL is designed to handle large data volumes that do not follow a fixed schema. It is ideally suited for use with very large data volumes that do not require the relational model.
- Object-Oriented Database: A database management system in which information is represented as objects, rather than data such as integers or numbers, as used in object-oriented programming. Also called Object Database Management Systems (ODBMS).
- Online Analytical Processing (OLAP): The process of analyzing multidimensional data using three operations: consolidation (the aggregation of available), drill-down (the ability for users to see the underlying details), and slice and dice (the ability for users to select subsets and view them from different perspectives).
- Online Transactional Processing (OLTP): The process of providing users with access to large amounts of transactional data in a way that they can derive meaning from it.
- Open Data Center Alliance (ODCA): A consortium of global IT organizations whose goal is to speed the migration of cloud computing.
- Open Source Software: Software with source code that is made available by the copyright holder free of charge to the general public. This code may be redistributed, and anyone can inspect and change it.
- Opendremel: The open source version of Google’s Big Query java code. It is being integrated with Apache Drill.
- Openpower Foundation: A collaborative organization initiated in IBM in 2013 as part of its effort to open up its Power Architecture products to a collaborative development approach. The foundation’s goal, according to its mission statement, is “to create an open ecosystem, using the POWER Architecture to share expertise, investment, and server-class intellectual property to serve the evolving needs of customers and industry.”
- Operational Data Store (ODS): A location to gather and store data from multiple sources so that more operations can be performed on it before sending to the data warehouse for reporting.
- Parallel Data Analysis: Breaking up an analytical problem into smaller components and running algorithms on each of those components at the same time. Parallel data analysis can occur within the same system or across multiple systems.
- Parallel Method Invocation (PMI): Allows programming code to call multiple functions in parallel.
- Parallel Processing: The ability to execute multiple tasks at the same time.
- Parallel Query: A query that is executed over multiple system threads for faster performance.
- Pattern Recognition: The classification or labeling of an identified pattern in the machine learning process.
- Performance Management: The process of monitoring system or business performance against predefined goals to identify areas that need attention.
- Petabyte: One million gigabytes or 1,024 terabytes.
- Platform-as-a-Service (PaaS): A cloud computing model that provides, over the internet as a service, a platform that includes hardware and software tools that enable customers to develop and manage applications.
- Predictive Analytics: Using statistical functions on one or more datasets to predict trends or future events.
- Predictive Modeling: The process of developing a model that will most likely predict a trend or outcome.
- Presto: An open source, distributed SQL query engine for running high-speed interactive analytics. Presto was designed to run interactive analytics while scaling up to petabyte size. Licensed by the Apache Software Foundation, contributors include Airbnb, Facebook, and Netflix.
- Privacy: The need and/or requirement to control access to and dissemination of sensitive, personal, and personally identifiable information in an organization’s data stores.
- Quantified Self: A movement marked by people’s use of data, often collected by sensors in wearable devices, to analyze factors such as health, activity levels, and sleep quality for the purpose of greater self-knowledge and improvement.
- Query Analysis: The process of analyzing a search query for the purpose of optimizing it for the best possible result.
- Radio-Frequency Identification (RFID): A technology that uses wireless communications to send information about an object from one point to another.
- Real Time: A descriptor for events, data streams, or processes that have an action performed on them as they occur.
- Recommendation Engine: An algorithm that analyzes a customer’s purchases and actions on an e-commerce site and then uses that data to recommend complementary products.
- Records Management: The process of managing an organization’s records throughout their entire lifecycle, from creation to disposal.
- Reference Data: Data that describes an object and its properties. The object may be physical or virtual.
- Report: The presentation of information derived from a query against a dataset, usually in a predetermined format.
- Risk Analysis: The application of statistical methods on one or more datasets to determine the likely risk of a project, action, or decision.
- Root-Cause Analysis: The process of determining the main cause of an event or problem.
- Scalability: The ability of a system or process to maintain acceptable performance levels as workload or scope increases.
- Schema: The structure that defines the organization of data in a database system.
- Search Data: Aggregated data about search terms used over time.
- Self-Service BI: An approach that allows business users to access and work with data sources even though they do not have an analyst or computer science background. They can access, profile, prepare, integrate, curate, model, and enrich data for analysis and consumption by BI platforms. In order to have successful self-service BI, the BI tool must be centralized and governed by IT.
- Semantic Web: A project of the World Wide Web Consortium (W3C) to encourage the use of a standard format to include semantic content on websites. The goal is to enable computers and other devices to better process data.
- Semi-Structured Data: Data that is not structured by a formal data model, but provides other means of describing the data and hierarchies.
- Sentiment Analysis: The application of statistical functions on comments people make on the web and through social networks to determine how they feel about a product or company.
- Server: A physical or virtual computer that serves requests for a software application and delivers those requests over a network.
- Smart Data: Smaller data sets from Big Data that are valuable to the enterprise and can be turned into actionable data.
- Smart Data Discovery: The processing and analysis of Smart Data to discover insights that can be turned into actions to make data-driven decisions in an organization.
- Smart Grid: The smart grid refers to the concept of adding intelligence to the world’s electrical transmission systems with the goal of optimizing energy efficiency. Enabling the smart grid will rely heavily on collecting, analyzing, and acting on large volumes of data.
- Smart Meter: An electrical meter that monitor and report energy usage and are capable of two-way communication with the utility.
- Social BI: An approach where social media capabilities, such as social networking, crowdsourcing, and thread-based discussions are embedded into Business Intelligence so that users can communicate and share insights.
- Social Enterprise: An enterprise that has a new level of corporate connectivity, leveraging the social grid to share and collaborate on information and ideas. It drives a more efficient operation where problems are uncovered and fixed before they can affect the revenue streams.
- Software as a Service (SaaS): Application software that is used over the web by a thin client or web browser. Salesforce is a well-known example of SaaS.
- Solid-State Drive (SSD): Also called a solid-state disk, a device that uses memory ICs to persistently store data.
- State Of The Art BI: The highest level of technology, the most up-to date features, and the best analysis capabilities in a Business Intelligence solution.
- Storage: Any means of storing data persistently.
- Storm: An open-source distributed computation system designed for processing multiple data streams in real time.
- Structured Data: Data that is organized by a predetermined structure.
- Structured Query Language (SQL): A programming language designed specifically to manage and retrieve data from a relational database system.
- Suggestive Discovery Engine: An engine behind the program that recommends to the users the most relevant insights to focus on, based on personal preferences and behavior.
- Systems of Insight: This is a term coined by Boris Evelson, VP of Forrester Research. It is a Business Intelligence system that combines data availability with business agility, where both IT and business users work together to achieve their goals.
- Terabyte: 1,000 gigabytes.
- Text Analytics: The application of statistical, linguistic, and machine learning techniques on text-based sources to derive meaning or insight.
- Transactional Data: Data that changes unpredictably. Examples include accounts payable and receivable data, or data about product shipments.
- Transparency: As more data becomes openly available, the idea of proprietary data as a competitive advantage is diminished.
- Unstructured Data: Data that has no identifiable structure – for example, the text of email messages.
- Workboards: An interactive data visualization tool. It is like a dashboard that displays the current status of KPIs and other data analysis, with the possibility to work directly on it and do further analysis.
Leave a Reply
Your email is safe with us.