Skip to main content

How Data Cloud Powers Regression and Classification Model Training Across 20 Million Rows

Smita Kulkarni
May 27 - 6 min read

In our “Engineering Energizers” Q&A series, we spotlight the groundbreaking engineers at Salesforce. Today, we feature Smita Kulkarni, who specializes in developing the core systems that drive predictive models across Data Cloud. These systems support thousands of use cases and process large data.

Dive into how Smita’s team completely overhauled legacy systems, managed intricate data transformations at high volume, and scaled their infrastructure to handle massive datasets with both speed and reliability.

What is your team’s mission as it relates to regression and classification models?

The team remains focused on building and evolving the core infrastructure for regression and binary classification model training. These models are crucial for predicting outcomes using structured data, such as profit margins, customer churn, and the likelihood of closing deals. By providing these insights, smarter, data-driven decisions are enabled across the platform.

The journey began with the re-platforming of CRMA’s legacy system, transforming it into a modern, high-performance pipeline. This new pipeline is designed to handle high-volume, high-cardinality datasets with ease. It is natively integrated with Data Cloud and providing predictive insights to the vast amounts of customer data to drive effective business outcomes. Sophisticated transformation layers for normalization, bucketing, and clustering prepare the data optimally for model training. Logic to support both numeric and text-based features is also developed, making the system highly versatile and adaptable to various use cases.

The infrastructure is built with flexibility, scalability, and user-friendliness at its core, especially for users who may not have deep data science expertise. The goal is not only to enable large-scale model training but to make it intuitive, efficient, and reliable for all users on the Salesforce platform. This ensures that everyone can harness the power of data to drive business forward with confidence.

What was the most significant technical challenge that your team faced as it relates to regression and classification models?

The most substantial challenge during the migration from CRMA was rearchitecting the modeling system. This involved reconstructing nearly every aspect of the infrastructure, including job orchestration, transformation logic, query planning, and SQL dialects. CRMA utilized Job Controller, whereas the new system operates on Dynamic Processing Controller (DPC), which demanded a fundamental change in job scheduling, tracking, and retry mechanisms.

One of the primary bottlenecks was ensuring data readiness. While CRMA datasets were narrowly defined, the new system had to handle vast amounts of customer data. This required addressing issues like query service limits, inconsistent data formatting (for example, “CA” versus “California”), and high cardinality. To overcome these challenges, the team developed sophisticated transformation-aware logic to efficiently cluster, bucket, and segment inputs.

API stability posed another significant constraint. Teams like Templates depend on consistent system behavior, so maintaining compatibility was essential throughout the transition. Since none of the original components were reusable, each stage of the pipeline was rebuilt and rigorously validated to meet modern performance and integration standards.

Reflecting on the re-platforming, the effort was more than just a technical upgrade — it marked a transformation in approach. The team focused on rebuilding trust in the modeling experience, ensuring reliability and performance from the ground up.

Architectural overview of Salesforce AI stack.

How do you manage challenges related to scalability in regression and classification models?

To ensure scalability, the team continuously enhances the system’s infrastructure, orchestration, and transformation processes. Initially, the system could handle 50 columns and 20 million rows with 5000 unique values per column, but it has now been scaled to support:

  • 300 columns
  • 20 million rows
  • 10,000 unique values per column

To achieve this, the query engine was upgraded from Trino to Hyper, which dramatically improved data read speed and query reliability. On the orchestration side, the transition to Dynamic Processing Controller (DPC) enabled higher concurrency and more dynamic job queueing. The team is also updating segmentation logic to partition large datasets into smaller, more memory-efficient training batches.

Introducing multi-DMO joins brought new challenges. These joins can significantly increase row counts, introduce schema variability, and complicate transformation steps. To address these issues, the team developed advanced query planning tools, optimized batching logic, and implemented transformation-aware safeguards throughout the pipeline.

Scaling isn’t just about raising limits; it’s about strategically applying structure to ensure the system remains responsive, predictable, and safe, even under heavy load. These enhancements have greatly expanded the capabilities for Data Cloud users who need to train models on massive, high-cardinality datasets.

What improvements are being developed to enhance regression and classification models?

In the upcoming releases, we’ll expand the use cases for the Model Builder and Data Cloud, focusing on forecasting, optimization, clustering, and sentiment analysis. We’re starting with multi-class classification models, which will help users predict which of 50 categories a record belongs to. To assist non-technical users, we’re adding generative insights to explain the data set, enhancing their understanding and improving model accuracy.

The new query engine can now process up to 500 million rows of data, and these changes are undergoing rigorous testing before release. We’re also enhancing diagnostic capabilities, providing alerts for issues like weak predictors and class imbalance, along with prescriptive suggestions and automatic retries.

On the infrastructure side, the transition to the Dynamic Processing Controller (DPC) is improving job reliability and throughput. We’re also developing features to support modeling across joined data objects, enhancing training inputs.

Our goal is to make it easier for more users to build better models with confidence, reducing guesswork.

How do you gather your feedback from users on regression and classification models and how has it influenced future development?

We gather feedback from internal engineering teams who use our APIs in live production environments. These teams provide valuable, early, and frequent insights into usability issues, model quality gaps, and infrastructure performance.

We also gather feedback from customers. The feedback helps us understand the customer business requirements and how we can build features to address it. It also has a major influence on feature priorities. In addition, we have dashboards that gives us real time data on feature usage and performance.

The platform has shifted its approach from waiting for bug reports to proactively surfacing and addressing issues before users even notice them. This proactive strategy has made modeling more stable, responsive, and ultimately more empowering for the teams that rely on it.

How do you balance the need for regression and classification models’ rapid deployment with maintaining high standards of trust and security?

Balance is maintained through a gated deployment process with required checks at each stage: Development, Testing, Staging, and Production. Feature flags are used to isolate new capabilities during testing and integration, ensuring that unstable or incomplete features do not reach production prematurely.

Security and compliance are integral to the design process. Every API and feature undergoes a rigorous review, including GDPR compliance, threat modeling, and formal approval. Customer data is never logged, and all model job telemetry is scrubbed to meet both internal and external standards.

The model training engine is shared across internal teams like Templates and Agentforce. To support these dependencies, the platform uses strict versioning, contract enforcement, and comprehensive regression test suites, preventing any disruptions at integration points.

Trust is a fundamental design constraint. The team ensures that every feature meets the highest standards of trust and reliability. If a feature doesn’t meet these standards, it is not released.

Learn more

Related Articles

View all