The financial services sector has always been the leading consumer of compute power and high-speed networking. It has also been a consistent driver of High Performance Computing (HPC) with market and portfolio simulations, high-frequency trading, order execution, and many other examples. Today, increasing regulatory requirements, rising cybercrime concerns, the growing sophistication of consumers, and the wealth of new relevant datasets has placed big data analytics at the center of every financial firm’s strategy.
Open Source: Engine of Disruption
The sea change that many quants and data scientists are witnessing is that the processing power needed for high end computing tasks is now much more affordable and increasingly available via private cloud infrastructure and public cloud providers. This has emboldened a new generation of entrepreneurs to use emerging Open Source Software (OSS) to challenge financial business as usual with only an algorithm, a business model and a dream.
The explosion of interest in OSS machine learning and big data platforms is also a clear indicator of the nature of the disruption to the old-world order in financial services. Open source platforms such as Apache Spark, TensorFlow and other machine learning platforms are attracting a growing number of developers who want to turn OSS big data innovations into new services. These OSS platforms are designed to scale to handle the huge data volumes that make machine learning and big data analytics come to life. The scale of these projects puts significant pressure on IT operations and DevOps to maximize the efficiencies and performance of their computing resources.
The topic of risk continues to be of critical importance across financial services segments. While there are many forms of risk, the most common form of risk across all financial segments is surrounding cybercrime and fraud. There is also a post-financial crisis regulatory aspect of risk management that forces lenders to know precisely how much capital they need in reserve. Keep too much and you tie up capital unnecessarily, lowering profit. Keep too little and you run afoul of Basel III regulations.
There is great promise in new big data and machine learning technologies to enable lenders to tap into an ever-deepening pool of new data to analyze all aspects of risk and fraud. Identifying risk and fraud can require huge data volumes and large compute clusters which are typical of modern big data systems.
Fraud detection is a classic example of predictive analytics at work. For data scientists, fraud can be determined precisely by building the right scoring model and associating the scoring model with actual business costs. These fraud models identify the rules of what constitutes the fraud, and then those models crunch through the relevant data sets to identify the cost of the fraud versus the cost to detect the fraud. Therefore, cluster performance of the big data platform is becoming increasingly critical.
While using advanced algorithms to make informed trades is not new, its widespread adoption and applicability to a broader profile of traders is noteworthy. Today, asset managers, fintech companies and even retail banks are looking to provide richer analytics, daily forecasts, market advisories and recommendations to both industry and consumers. With large scale analytics offerings from TransUnion, S3 Partners, and hundreds of others, financial firms are gearing up to support the explosive growth of automated system trading. As you can see from the diagram, automated trades (red and blue lines) now surpass manual trades (green lines) by a wide margin. This is putting increased stress on cloud and big data infrastructure.
The growth of machine learning and big data platforms has created a compute bottleneck. The main contributing factors to this bottleneck are higher data volumes, larger memory systems, solid state disks (SSDs), flash arrays, and faster networks. Traditional processors have simply not kept pace with the growing scale of the IO.
While no one is predicting the end of many-core CPUs, there is general agreement that the long-term reliance on Moore’s Law for application scaling is coming to an end. The doubling of transistors is dramatically slowing down now reaching physical limits so compute-hungry firms are looking elsewhere to increase compute density in the data center.
Those charged with running data centers – either your own IT department or a cloud provider – are tasked with delivering high performance BI and analytics at scale. This has resulted in an increased adoption of specialized hardware accelerators such as GPUs, FPGAs and ASICs (like Google’s TPUs) to improve compute density. With public cloud leaders Amazon AWS (F1 Instances) and Microsoft Azure (Configurable Cloud) employing FPGAs, many financial services firms are following suit in their own data centers.
While FPGAs have played a specialized role in financial services for use cases like high frequency trading, the growing emphasis on predictive analytics in fraud detection and machine learning in trading and pricing systems makes the flexibility and reprogrammability of FPGAs very interesting to financial firms. The traditional objections to using FPGAs – difficult to program, skills shortage – are addressed head-on by the Bigstream Hyper-acceleration Layer. Bigstream combines software acceleration technology with hardware based accelerators in a seamless solution that provides 3x-30x acceleration of fast data workloads.
Moving from Big Data to Fast Data
Financial firms are experiencing the same logarithmic increase in data volumes as everyone else. They are also the first to understand that bringing the computing power to process fast data at scale is proving costly and unpredictable. This is one of the key reasons for the growing adoption of hardware accelerators like GPUs and FPGAs. Hardware accelerators increase processing density and can tackle the increased emphasis on distributed processing that is characteristic of big data and machine learning. Clearly, doing nothing is not an option: if inaction can lead to disruption, then taking action can bring about some exciting business outcomes
Faster Trades and Insights
In areas like algorithmic trading, complex derivative or option pricing, back testing and other compute-intensive big data workloads, bringing more processing power to bear can materially affect returns and fund
Empowered Quants and Data Scientist
Simply put, quants and data scientists want to get there faster. They are often looking for any performance edge that can enable them to iterate faster on their machine learning models, and to lower latencies for processing tick data
and other incoming fast data streams. Being able to move daily model testing to an intra-day schedule means more accurate models in less time.
Maximize IT Infrastructure ROI
The other side of the performance equation is a straight cost argument. Whether you are talking public cloud costs or data center costs like real estate, power, and HVAC, a compute power efficiency gain of even 20% can add up to millions in savings, or a significant reduction in project backlog. Then, consider that Bigstream can deliver 200% to 1000% acceleration and the ROI numbers become really interesting.
Simplifying Cloud Scaling
Different workloads have different scaling characteristics, but all workloads have this in common: if each individual cluster node is not performing optimally, then the entire cluster is not scaling to its full potential. Data driven financial firms routinely fire up 100-500+ node clusters to power market simulations and customer behavior analysis in both private and public clouds. IT Ops teams would like to have a predictable scaling model so that they get the maximum
value from their processing time.
Scaling the Benefits of Data Science and Machine Learning
Because of the heightened need for data science, data engineering and big data development talent, financial services companies must think of scaling in terms of the amount of data, the amount of processing, and the number of end users that can gain benefit from these large data sets. Given the widely reported shortage of data scientists and machine learning specialists, the open source community has repeatedly turned to SQL as a key enabler to scaling the value of data and analytics.
According to the Spark Survey 2016, Spark SQL/Dataframes are the fastest growing components used in production and there are myriad SQL and SQL-like access methods. Virtually every data scientist uses SQL in their work because it is a foundational tool to access data found in databases, and virtually every application we use is a database application.
SQL access to data – wherever it lives – is a key part of scaling the value of data, the value of your data scientists, and the monetization of your business.
Key SQL Operations for Data Scientists and Quants
Every data scientist should know how to model one-to-one, one-to-many, and many-to-many relationships.
Data analysis is all about aggregations. Aggregation functions are very useful for understanding the data, and to present its summarized picture.
Some of the most powerful functions within SQL, these unlock the ability to calculate moving averages, cumulative sums, and much more.
While many turn to scripting languages for text mining, SQL has powerful built-in capabilities that can benefit from acceleration technologies.
Developing 1-day, 5-day, 30-day moving averages based on daily close data is a common machine learning task for investment houses and hedge funds.
In addition to SQL, many financial firms need to incorporate algorithmic constructs in their analytics workloads through open source or proprietary libraries. Bigstream Hyper-acceleration is designed to accelerate a wide variety of workloads that includes Spark SQL, Spark Dataframe/Datasets, and User Defined Functions (UDFs).
Bigstream Hyper-acceleration Layer is a software solution that resides in the big data infrastructure where Apache Spark and other big data platforms live. Bigstream uses advanced compiler technology to provide native scaling of Spark workloads. Bigstream also offers automatic programming for FPGAs to provide frictionless acceleration of big data workloads. This is done without impacting software developers because no application code changes are required, and no FPGA programming skills are necessary.
In a typical big data analytics pipeline, Bigstream can accelerate data ingest, data discovery, data parsing, ETL transformations, SQL analytics, data compression and decompression, User Defined Functions, and numerous other processes where Spark SQL is used.
Unlike other hardware-specific acceleration products, Bigstream Hyper-acceleration provides platform-level acceleration, requiring no special APIs, no code rewrites or application redesign. This is accomplished by using an intelligent and adaptive combination of acceleration techniques such as zero-copy, in-line code optimizations, locality tuning, vectorization, native compilation of Spark functions and UDFs.
Bigstream Hyper-acceleration can help you accomplish the following:
Improved fraud detection is achieved through more precise models – enabling data science teams to use data to understand customer behaviors and to predict future behavior. How are more precise models achieved? After continuous tweaking and iterating on these
models until they display optimum performance. If one iteration of model development can be accelerated by 2X, 5X or 10X, then fraud models improve and more fraud is detected or prevented.
Greater analytical throughput from being able to mine data from diverse sources and get that data organized into a structure that is meaningful to business users. The ability to bring the full power of many-core CPUs and FPGAs together without burdening the developer
saves computing time, streamlines DevOps, and reduces uncertainty at the customer site. Questions like “how can I take full advantage of FPGAs, CPUs and Spark to advance my business goals?” now have an answer: deploy Bigstream Hyper-acceleration.
Bigstream TPC-DS benchmark results show Bigstream Hyper-accelerated Spark performing at an average of almost 3X faster than Apache Spark on the same hardware. Early benchmark results using FPGAs suggest that 10X
acceleration and beyond is possible.
Modernize Data Warehousing and BI. The trend away from ivory tower systems that are the playground of a chosen few is rapidly fading. The race is on to get more intelligence to more end users faster. Boosting performance of everything that makes up a modern big
data warehouse is critical to achieving that goal.
Curb out of control big data infrastructure costs. While public and private clouds have undoubtedly made deploying and scaling new applications easier, it is a foregone conclusion that this level of abstraction doesn’t usually yield the best performance, which translates quite directly into OPEX costs, whether paid to Amazon, Microsoft, or internal IT.
These are just a few examples of how big data is changing the financial services landscape. Entrepreneurs in FinTech and data driven development teams are increasingly relying on new generation open source platforms like Spark to redefine financial analytics. The
Bigstream Hyper-acceleration layer provides a frictionless method for these professionals to achieve super-computing performance on the new generation of big data infrastructure.
The Economist on Moore’s Law
Download the PDF
For more information on how Bigstream Hyper-acceleration works, read the Bigstream whitepaper