Helping computers learn to tackle big-data problems outside their comfort zones
Imagine combing through thousands of mugshots desperately looking for a match. If time is of the essence, the faster you can do this, the better. A*STAR researchers have developed a framework that could help computers learn how to process and identify these images both faster and more accurately1.
Peng Xi of the A*STAR Institute for Infocomm Research notes that the framework can be used for numerous applications, including image segmentation, motion segmentation, data clustering, hybrid system identification and image representation.
A conventional way that computers process data is called representation learning. This involves identifying a feature that allows the program to quickly extract relevant information from the dataset and categorize it — a bit like a shortcut. Supervised and unsupervised learning are two of the main methods used in representation learning. Unlike supervised learning, which relies on costly labeling of data prior to processing, unsupervised learning involves grouping or ‘clustering’ data in a similar manner to our brains, explains Peng.
Subspace clustering is a form of unsupervised learning that seeks to fit each data point into a low-dimensional subspace to find an intrinsic simplicity that makes complex, real-world data tractable. Existing subspace clustering methods struggle to handle ‘out-of-sample’, or unknown, data points and the large datasets that are common today.
“One of the challenges of the big-data era is to organize out-of-sample data using a machine learning model based on ‘in-sample’, or known, observational data,” explains Peng who, with his colleagues, has proposed three methods as part of a unified framework to tackle this issue. These methods differ in how they implement representation learning; one focuses on sparsity, while the other two focus on low rank and grouping effects. “By solving the large-scale data and out-of-sample clustering problems, our method makes big-data clustering and online learning possible,” notes Peng.
The framework devised by the team splits input data into ‘in-sample’ data or ‘out-of-sample’ data during an initial ‘sampling’ step. Next, the in-sample data is grouped into subspaces during the ‘clustering’ step, after which the out-of-sample data is assigned to the nearest subspace. These points are then designated as cluster members.
The team tested their approach on a range of datasets including different types of information, from facial images to text — both handwritten and digital — poker hands and forest coverage. They found that their methods outperformed existing algorithms and successfully reduced the computational complexity (and hence running time) of the task while still ensuring cluster quality.
Learn more: Thinking outside the sample
The Latest on: Big data
via Google News
The Latest on: Big data
- Hadoop Big Data Analytics Market Size by Product Type, By Application, By Competitive Landscape, Trends and Forecast by 2027on July 2, 2021 at 11:58 am
Hadoop Big Data Analytics Market. Global Hadoop big data analytics market is set to witness a healthy CAGR of 40.3 % in the forecast period of 2019 to 2026. The r ...
- Big Data Analytics Finding Gaps in Chronic Disease Management Careon June 30, 2021 at 10:00 am
How big data analytics and artificial intelligence can play an critical role in chronic disease management care. Algorithm biases need to be avoided through proper AI training.
- The 10 Coolest Big Data Tools Of 2021 (So Far)on June 30, 2021 at 9:18 am
Some of the coolest big data management tools to debut in 2021 include software from database startups Cockroach Labs and Yugabyte and data sharing software from Databricks.
- Big Data as a Service Market Analysis and Demand With Forecast Overview To 2025on June 30, 2021 at 8:22 am
Selbyville, Delaware Market Study Report Has Added A New Report On Big Data as a Service Market analysis mainly introduces the changing market dynamics in terms of covering all details inside analysis ...
- OmniSci Recognized by SIIA as Best Big Data Reporting & Analytics Solutionon June 29, 2021 at 9:13 am
OmniSci, the pioneer in accelerated analytics, has been named the best Big Data Reporting and Analytics Solution of 2021 as part of the annual SIIA CODiE Awards. The CODiE Awards recognize the ...
- Big Data: The Science Behind A Better Customer Experienceon June 29, 2021 at 6:20 am
No matter what industry you work in, by now, you have probably heard people talking about “big data.” You don’t have to be Facebook or Google to know that customer interactions can yield a wealth of ...
- Visier raises $125M at a $1B valuation for its big-data approach to HR analytics and planningon June 29, 2021 at 5:24 am
The world of work has changed massively in the last year, and with it a rush of startups have emerged with new technology and approaches to improve how it is shaped, and specifically how human ...
- Firebolt raises $127M more for its new approach to cheaper and more efficient big data analyticson June 24, 2021 at 3:59 am
Snowflake changed the conversation for many companies when it comes to the potentials of data warehousing. Now one of the startups that’s hoping to disrupt the disruptor is announcing a big round of ...
- Bidtellect Crunches Big Data for Smarter Advertising With HPE Ezmeralon June 23, 2021 at 8:00 am
Hewlett Packard Enterprise today announced that programmatic digital advertising platform, Bidtellect, has selected HPE Ezmeral Data Fabric to help facilitate efficiency in its native ad buying, ...
via Bing News