Thinking outside the sample for big data

A*STAR researchers present a new machine learning framework to solve big-data problems. © Pavel_R/Getty — A*STAR researchers present a new machine learning framework to solve big-data problems.
via Pavel_R/Getty

Helping computers learn to tackle big-data problems outside their comfort zones

Imagine combing through thousands of mugshots desperately looking for a match. If time is of the essence, the faster you can do this, the better. A*STAR researchers have developed a framework that could help computers learn how to process and identify these images both faster and more accurately¹.

Peng Xi of the A*STAR Institute for Infocomm Research notes that the framework can be used for numerous applications, including image segmentation, motion segmentation, data clustering, hybrid system identification and image representation.

A conventional way that computers process data is called representation learning. This involves identifying a feature that allows the program to quickly extract relevant information from the dataset and categorize it — a bit like a shortcut. Supervised and unsupervised learning are two of the main methods used in representation learning. Unlike supervised learning, which relies on costly labeling of data prior to processing, unsupervised learning involves grouping or ‘clustering’ data in a similar manner to our brains, explains Peng.

Subspace clustering is a form of unsupervised learning that seeks to fit each data point into a low-dimensional subspace to find an intrinsic simplicity that makes complex, real-world data tractable. Existing subspace clustering methods struggle to handle ‘out-of-sample’, or unknown, data points and the large datasets that are common today.

“One of the challenges of the big-data era is to organize out-of-sample data using a machine learning model based on ‘in-sample’, or known, observational data,” explains Peng who, with his colleagues, has proposed three methods as part of a unified framework to tackle this issue. These methods differ in how they implement representation learning; one focuses on sparsity, while the other two focus on low rank and grouping effects. “By solving the large-scale data and out-of-sample clustering problems, our method makes big-data clustering and online learning possible,” notes Peng.

The framework devised by the team splits input data into ‘in-sample’ data or ‘out-of-sample’ data during an initial ‘sampling’ step. Next, the in-sample data is grouped into subspaces during the ‘clustering’ step, after which the out-of-sample data is assigned to the nearest subspace. These points are then designated as cluster members.

The team tested their approach on a range of datasets including different types of information, from facial images to text — both handwritten and digital — poker hands and forest coverage. They found that their methods outperformed existing algorithms and successfully reduced the computational complexity (and hence running time) of the task while still ensuring cluster quality.

Learn more: Thinking outside the sample

Big Data Needs a Big Theory to Go with It

The Latest on: Big data

[google_news title=”” keyword=”big data” num_posts=”10″ blurb_length=”0″ show_thumb=”left”]

via Google News

The Latest on: Big data

Johannesburg: data centre hub for big operators in South Africa
on May 8, 2024 at 4:10 pm
Cushman & Wakefield | BROLL says operators seek sites to the north and east of Johannesburg, in well-located nodes with power supply, while in the smaller Cape Town market data centres are expanding, ...
Data Dynamics: Shaping the Future of Online Casino Analytics
on May 8, 2024 at 12:33 pm
The intricate symbiosis between technology and data analytics is fundamentally reshaping the strategies and operations of online casinos, marking a new era in online gambling. This discourse assesses ...
Lagrange raises $13M to build blockchain-based cryptographic ‘big data’-scale computation
on May 8, 2024 at 11:26 am
Lagrange Labs, a startup building the blockchain-based cryptography protocol Lagrange, today announced it raised $13.2 million in seed funding to progress its vision of scaling up verifiable ...
Big data helps determine what drives disease risk
on May 8, 2024 at 10:21 am
Working with nearly 3,000 observations across almost 1,500 host-parasite combinations, researchers at Notre Dame University have found that biodiversity loss, chemical pollution, introduced species, ...
Unlocking big data in commercial real estate
on May 8, 2024 at 7:00 am
Strategists at a commercial real estate developer recently had a major “Aha!” moment: The markets they are targeting for new projects are precisely those where the competition is most intense, ...
In an Era of Artificial Intelligence and Big Data, Human Touch Is Needed
on May 8, 2024 at 6:19 am
While AI can assist with some decisions, it’s feasible that, at some level, overreliance on these technological capabilities ...
Unveiling the Power of Big Data Analytics in the Casino Industry
on May 7, 2024 at 1:12 pm
The modern digital era is the age of data and this applies to casinos as well. Big data analytics has transformed the working process of casinos providing them with vital information about player ...
AI and Big Data Take the Centre Stage in Central Asia at Beetech 2024 Hosted by Beeline Kazakhstan and QazCode
on May 6, 2024 at 3:43 am
Kazakh and international delegates participate in the annual conference, discussing artificial intelligence, big data analytics and app ...
Big Data and Global Trade Law
on May 6, 2024 at 1:04 am
39, Issue. 1, p. 85. This collection explores the relevance of global trade law for data, big data and cross-border data flows. Contributing authors from different disciplines including law, economics ...
Publicis Won’t Use Nielsen’s Big Data as Currency in This Upfront
on May 3, 2024 at 6:14 am
Giant media buyer believes numbers haven’t been stable and there isn’t enough historical data to make projections ...