Natural Language Processing Track

Keynote Session – Suchi Saria

Wed, October 28, 9:30 AM
(PDT)

Suchi Saria, PhD | Director, Machine Learning & Healthcare Lab | Johns Hopkins University

Type: Keynote

A Secure Collaborative Learning Platform

Wed, October 28, 10:00 AM

Raluca Ada Popa, PhD | Assistant Professor | Co-Founder | Berkeley | PreVeil

Type: Keynote

OCTOBER 29^TH

Data for Good: Ensuring the Responsible Use of Data to Benefit Society

Thu, October 29, 9:00 AM
(PDT)

Jeannette M. Wing, PhD | Avanessians Director of the Data Science Institute and Professor of Computer Science | Columbia University

Causal INFERENCE Effects – estimate effects
Over and under estimation of instrumental variables
Confounders: Model assigned causes – Over and under estimation
De-Confounder: Estimate substitute confounders – Over and under estimation
Convolutional Neuro-networks model
Economics: Monopsony, Robo-Advising
History: Topic modeling with NLP,
Trustworthy Computing vs Trustworthy AI: Safety, Fairness, Robustness
Classifiers: Fair/Unfair make then more robust to a class of distributions
Image recognition system: DeepXplore: Semantic perturbation
DP and ML: PixelDP – STOP sign vs Yield sign
HealthCare @Columbia University: 600 Million EHR

The Medical De-confounder: Treatment Effects on A1c DM2

Type: Keynote, Level: All Levels, Focus Area: AI for Good, Machine Learning

Keynote Session – Ben Taylor

Thu, October 29, 9:30 AM
(PDT)

Ben Taylor, PhD | Chief AI Evangelist | DataRobot

Convolution NN – Clustering of Countries: Latin America, Asia
Story telling
Acceleration:

GPT-3 from OpenAI – Q&A, Translation, grammar
Image GPT

Can AI Predict

Type: Keynote, Level: All Levels, Focus Area: Data Science Track

Applying AI to Real World Use Cases

Thu, October 29, 10:00 AM
(PDT)

John Montgomery | Corporate Vice President, Program Management, AI Platform | Microsoft

Type: Keynote

Machine comprehension
Massive ML Models: Vision Model – Reznet
Alternative to Azure, OpenAI (Partner of Microsoft) released –>>>>> GPT-3 1758
AZURE ML: create models, operationalize models, build models responsibly
Model interpretability – Data Science, gov’t regulation: Features importance dashdourd
USE CASES
Building accurate models

Little Ceasar’s Pizza: “Hot N-Ready” – Demand forecasting of Pizza Supply by combination of ingredients

Predict: X Quantity by Auto ML

Deploy and Manage Many Models: MMM Accelerator: Ten Models at AGL – Australia renewal energy

Model for Responsible ML: Fairness & Interpretability

EY – Bank denies a LOAN
Mitigation of Bias detection for Men and Women in Loan Applications

Loan Approval

Explanation dashboard – Aggregate model: Top feature in loan approval: Education Level
Fairness – Hazard performance for Accuracy: Disparity in prediction by Gender

ML is part of AZURE Platform

Bonsai – is Reinforcement Learning: Simulation Scenarios

AutoML – do know standard algorithms vs when you do not know

TALKS on 10/29/2020

NLP

Thu, October 29, 10:30 AM
(PDT)

Tian Zheng, PhD | Chair, Department of Statistics | Associate Director | Columbia University | Data Science Institute

Type: Track Keynote, Level: Intermediate, Focus Area: NLP

Stochastic variability inference
Case-control likelihood approximation
Sampling node system

TEXT

LDA – Latent Distribution Modeling Dirichlet

Probability distribution over the vocabulary of words: Topic assignment

LINKS

MMSB – Mixed Membership

Detect communities in networks

blockmodel – profile of social interaction in different nodes

LMV – Pairwise-Link-LDA – same topic proportions have equal % for citing

Pair-wise-Link-LDA

Draw topic
Draw Beta
For each document
For each document pair

Variational Inference – fully factored model

article visibility

Stochastic Variation Inference

local (specific to each node) & global (across nodes)
At each iteration minibatch of nodes

Sampling Document pairs

Stratified sampling scheme – shorter link
Informative set sampling [informative vs non-imformative sets]
these scheme – Mean estimation problem: Inclusion probability: All links are included
Stochastic gradient updates for global parameters
Comparison with alternative Approaches

LDA + Regression
Relational topic model
Pairwise-Link-LDA combine LDA and MMB [Same priors]

Predictive ranks (random guessing) and Runtimes (compact id distinct no overalp)

evaluate model fit: average predictive rank of held-out documents – Top articles

Cora dataset

LMVS – better predictive performance than

KDD Dataset

Citation trends in HEP: Relevance of Topics vs Visibility

Article recommendation by Rank Topic Proportions

Visibility as a topic-adjusted measure

More recent are more visible

CItation is not a strong indicator for visibility

Visibility as a topic-adjusted measure

Making Deep Learning Efficient

Thu, October 29, 11:20 AM
(PDT)

Kurt Keutzer, PhD | Professor, Co-founder, Investor | UC Berkeley, DeepScale

Type: Track Keynote

ML – SubSets

Deep Learning – TRAINING for Clssification – Neuralnets – LeNet vs AlexNet – 7 layers 140x flops – using parallelism
Shallow learning – deterministic and linear classifier used
ML algorithms: Core ML, Audio analysis (Speech and audio recognition) , Multimedia
NLP: translation,
McKinsey & Co. – AI as a Service (AIasS)

PROBLEMS to Solve

Image Classification

Object Detection
Semantic Segmentation
Convolutional NN

Audio Enhancement at BabbleLabs

Video Sentiment Analysis – Recommendations to Watch or to search

Natural Language Processing & Speech

Translation
Document understanding
Question answering
general language understanding evaluation (GLUE)

BerkeleyDeepDrive (BDD)

BERT – Transformer – 7 seconds per sentence

BERT-base
Q-BERT
Transformer

Computational Patterns of Deep NN (DNN) – TRAINING required for DNN

PLATFORMS OF CLOUD

GRADIANT DESCENT (GD)
Stochastic GRADIANT DESCENT (SGD)

Recommendation Models – DNN – Parallelism

Facebook – 80% is recommendation = Advertisement
No sharing of data by Collector: Alibaba, Facebook, twitter

Considerations

Latency – NETWORK WIFI
Energy
Computation power
Privacy
Quantization: Fewer Memory Accesses
Lower Precision implies higher
Flat Loss Landscape – Precision Layer by Layer
Move computation to the EDGE

Language Complexity and Volatility in Financial Markets: Using NLP to Further our Understanding of Information Processing

Thu, October 29, 12:10 PM
(PDT)

Ahmet K. Karagozoglu, Ph.D. | C.V. Starr Distinguished Professor of Finance | Visiting Scholar, Volatility and Risk Institute | Hofstra University | New York University Stern School of Business

Type: Track Keynote, Level: All Levels, Focus Area: NLP

Intelligibility Throughout the Machine Learning Life Cycle

Thu, October 29, 2:00 PM
(PDT)

Jenn Wortman Vaughan, PhD | Senior Principal Researcher | Microsoft Research

Type: Talk, Level: Beginner-Intermediate, Focus Area: Machine Learning

A Human-centered Agenda for Intelligibility
Beyond the model: Data, objectives, performance metrics
context of relevant stakeholders
Properties of system design vs Properties of Human behavior

Learning with Limited Labels

Thu, October 29, 3:05 PM
(PDT)

Shanghang Zhang, PhD | Postdoc Researcher | University of California, Berkeley

Type: Talk, Level: Intermediate-Advanced, Focus Area: Deep Learning, Research frontiers

How AI is Changing the Shopping Experience

Thu, October 29, 3:05 PM
(PDT)

Sveta Kostinsky | Director of Sales Engineering | Samasource
Marcelo Benedetti | Senior Account Executive | Samasource

Type: Talk, Level: Intermediate, Focus Area: Machine Learning, Deep Learning

quality rubric
Internal QA Sampling
Client QA Sampling
Auto QA

Transfer Learning in NLP

Thu, October 29, 3:40 PM
(PDT)

00:

03:

Joan Xiao, PhD | Principal Data Scientist | Linc Global

Type: Talk, Level: Intermediate, Focus Area: NLP, Deep Learning

Transfer learning enables leveraging knowledge acquired from related data to improve performance on a target task. The advancement of deep learning and large amount of labelled data such as ImageNet has made high performing pre-trained computer vision models possible. Transfer learning, in particular, fine-tuning a pre-trained model on a target task, has been a far more common practice than training from scratch in computer vision.

In NLP, starting from 2018, thanks to the various large language models (ULMFiT, OpenAI GPT, BERT family, etc) pre-trained on large corpus, transfer learning has become a new paradigm and new state of the art results on many NLP tasks have been achieved.

In this session we’ll learn the different types of transfer learning, the architecture of these pre-trained language models, and how different transfer learning techniques can be used to solve various NLP tasks. In addition, we’ll also show a variety of problems that can be solved using these language models and transfer learning.

Transfer learning: Computer Vision – ImageNet Classification
ResNet, GoogleNet, ILSVRC – VGG, ILSVRC’12 – AlexNet
Feature Extrator vs Fine-tune
Transfer learning: NLP
Transfer Transformer: Text-to-Text Transfer Transformer

Word embeddings: No context is taken into account – Word2vec, Glove
ELMo – embedding from language models: Contextual,
BERT – Bi-directional Encoder Representations fro Transformers
MLM – Masked Language Model: Forward, Backward, Masked
Next Sentence Prediction
Achieved SOTA – 11 tasks: GLUE, SQuAD 1.0

Predisction models;
Input
Label – IsNext vs NotNext

GLUE Test score

BERT BASE vs BERT LARGE

Featured-based approach

BERT Variants – TinyBert, Albert, RoBETa, DistilBert

Multi-lingual BERT, BERT other languages

A Primer in BERTology: How BERT Works

OpenAI built a text generator – too dangerous to release

OpenAI GPT-3 – Trained on 300B tokens – THREE models:

Zero-shot – English to French – no training
one-shots
Few-shot – the GOAL – GPT-3
GRT-3 is large scale NLP

Examples – Feature extraction

English to SQL
English to CSS
English to LaTex

Semantic textual similarity

NL inference

ULMFiT – Fine tuning – the larger the # of Training examples – the better the performance

LM pre-training – start from scratch: BART, Big Bird, ELECTRA, Longformer
LM fine-tuning
Classifier fine-tuning

Data augmentation

Contextual Augmentation

Original sentence
masked
augmented

Test generation

boolean questions
from structured data, i.e., RDF – Resource Description Framework

OCTOBER 30^TH

Generalized Deep Reinforcement Learning for Solving Combinatorial Optimization Problems

Fri, October 30, 9:00 AM
(PDT)

Azalia Mirhoseini, PhD | Senior Research Scientist | Google Brain

Type: Keynote

Abstract:

Many problems in systems and chip design are in the form of combinatorial optimization on graph structured data. In this talk, I will motivate taking a learning based approach to combinatorial optimization problems with a focus on deep reinforcement learning (RL) agents that generalize. I will discuss our work on a new domain-transferable reinforcement learning methodology for optimizing chip placement, a long pole in hardware design. Our approach is capable of learning from past experience and improving over time, resulting in more optimized placements on unseen chip blocks as the RL agent is exposed to a larger volume of data. Our objective is to minimize PPA (power, performance, and area), and we show that, in under 6 hours, our method can generate placements that are superhuman or comparable on modern accelerator chips, whereas existing baselines require human experts in the loop and can take several weeks.

Bio:

Azalia Mirhoseini is a Senior Research Scientist at Google Brain. She is the co-founder/tech-lead of the Machine Learning for Systems Team in Google Brain where they focus on deep reinforcement learning based approaches to solve problems in computer systems and metal earning. She has a Ph.D. in Electrical and Computer Engineering from Rice University. She has received a number of awards, including the MIT Technology Review 35 under 35 award, the Best Ph.D. Thesis Award at Rice and a Gold Medal in the National Math Olympiad in Iran. Her work has been covered in various media outlets including MIT Technology Review, IEEE Spectrum, and Wired.

Learning Based Approaches vs branch & Bound, Hill climbing, ILP
scale on distributed platforms
Device Placement – too big to fit – PARTITION among multiple devices – evaluate run time per alternative placements
Learn Placement on NMT – Profiling Placement on NMT
CPU + layers encoder and decoders – overhead tradeoffs – parallelization for work balancing
RL-based placement vs Expert placement
Memory copying task
Generalization to be achieved forr Device Placement Architecture
Embeddings that transfer knowledge across graphs
Graph Partitioning: Normalized cuts objective: Volume , Cuts,
Learning based approach Train NN on nodes of graph assign Probability of node belonging to a given partition
Continuous relaxation of Normalized cuts
Optimize expected normalized Cuts
Generalized Graph Partitioning Framework

Chip Placement Problem (Floor planning) – Chip Design – resource optimization, canonical reimforcement learning

Placement Optimmization using AGENTS to place the nodes
Train Policy to be using for placement of ALL chips
Compiling a Dataset of Chip Placements
Policy/Value Model Architecture to save wire length used
RISC-V: Placement Visualization: Training from Scratch (Human) 6-8 weeks vs Pre-Trained 24 hours

Keynote Session – Zoubin Ghahramani

Fri, October 30, 9:30 AM
(PDT)

Type: Keynote

Data- models predictiona decisions Understanding
AI & Games
AI + ML
Deep Learning! (DL)

NN – tunable nonlinear functions with many parameters
Parameters are weights of NN
Optimization + Statistics
DL – New-branding of NN
Many layers – ReLUs attention
Cloud resources
SW – TensorFloe, JAX
Industry investment in DL

DL – very successful

non-parametric statistics
use huge data – simulated data
automatic differentiation
stay close to identity – makes models deeps ReLU, LSTMs GRUs, ResNets
Symmentry parameter tieying

Limitations of DL

data hungry
adversarial examples
black-boxes – difficult to trust
uncertainty – not easily incorporated

Beyond DL

ML as Probabilistic Modeling: Data observed from a system
uncertainty
inverse probability
Bayes rule Priors from measured quantities inference for posterior
learning and predicting can be seen as forms of inference – likelihood
approximations from estimation of Likelihoods

Learning
Prediction
Model Comparison
Sum rule: Product rule

Why do probabilities matter in AI and DS?

COmplexity control and structure learning
exploration-exlpoitations trade-offs
Building prior knowledge algorithms for small and large data sets
BDP – Bayesian DL
Gaussian Processes – Linear and logistics regressions SVMs
BDL – Baysian NN/ GP Hybrids
Deep Sum=Product Networks – deescrimitive programming

Probabilistic Programming Languagues

Languages: Tensors, Turing,

Automatic Statistician –

model discovery from data and explain the results

Probabilistic ML

Learn from Data decision theory Prob AI BDL, Prob Prog,

Zoubin Ghahramani, 2015, Probabilistic machine learning and AI, Nature 521; 452-459

The Future of Computing is Distributed

Fri, October 30, 10:00 AM

(PDT)

Type: Keynote

1970 – ARPA net 1970 – distributed
1980 – High performance computing – HPC 1980s
1990 – WEB – Amazon
2000 – Big data – Google

Distributed computing – Few courses at universities

Rise of deep learning (DL)
Application becomes AI centered: Healthcare, FIN, Manufacturing
Morse law – is dead: Memory and Processors
Specialized hardware: CPU, GPU, TPU
Memory dwarfed by demand
Memory: Tutring Project 17B
GPT-2 8.3B
GPT-1
Micro-services: Clusters of clouds – integrating with distributed workloads
AI is overlapping with HPC
AI and Big Data

AI Applications

MPI,
Stitching several existing systems

RAY riselab @Berkeley – Universal framework for distributed computing (Python and JAVA) across different Libraries

Asynchronous execution enable parallelism
Function -> Task (API)
Object ID – every task scheduled
Library Ecosystem – Native Libraries 3rd Party Libraries
Amazon and AZURE SPARK, MARS (Tensor)

ADOPTIONS

Number of contributors increase fast N=300

TALKS on 10/30/2020

Fri, Oct 30, 2020 1:30 PM – 2:15 PM EDT

Lisa Amini, Director | IBM Research – Cambridge

Auto AI – holistic approach
Auto ML – Models: Feature creation, modeling, training & testing

AI AUTOmation for Enterprise

Feature Preprocessor ->>Feature Transformer Feature selector Estimator
Joint-optimization problem

Method selection
Hyper-parameter Optimization
Black-box constraints

Bias Mitigation Algorithms

Pre-processing algo
In-processing Algo
Post-processing algo

Automation for Data – READINESS for ML
relational data –
knowledge augmentation
Data readiness reporting
Labeling Automation: Enhance

Knowledge augmentation – Federated Learning

External data sources
existing data
documents containing domain knowledge
Automating Augmenting Data with knowledge: feature-concept mapping

Modeling

Time Series Forecasting

AI to decision Optimization

Demand forecasting from Standard AutoAI by ADDING Historical Decisions and Historical Business Impact__>> reinforced learning – Automatically created model from past and Auto AI

Validation

Meta-learning for performance prediction
Train the META data
Score production data with AI

Deployment

staged deployment with contextual bandits

Monitoring

Performance prediction meta model applied over windows of production traffic

INNOVATIONS;

End-to-end AI life cycle
expanding scope of automation; Domain knowledge and decision optimization

The State of Serverless and Applications to AI

Fri, October 30, 11:20 AM
(PDT)

Joe Hellerstein, PhD | Chief Strategy Officer, Professor of Computer Science | Trifacta, Berkeley

The Cloud and practical AI have evolved hand-in-hand over the last decade. Looking forward to the next decade, both of these technologies are moving toward increased democratization, enabling the broad majority of developers to gain access to the technology.

Serverless computing is a relatively new abstraction for democratizing the task of programming the cloud at scale. In this talk I will discuss the limitations of first-generation serverless computing from the major cloud vendors, and ongoing research at Berkeley’s RISELab to push forward toward “”””stateful”””” serverless computing. In addition to system infrastructure, I will discuss and demonstrate applications including data science, model serving for machine learning, and cloud-bursted computing for robotics.

Bio:

Joseph M. Hellerstein is the Jim Gray Professor of Computer Science at the University of California, Berkeley, whose work focuses on data-centric systems and the way they drive computing. He is an ACM Fellow, an Alfred P. Sloan Research Fellow and the recipient of three ACM-SIGMOD “Test of Time” awards for his research. Fortune Magazine has included him in their list of 50 smartest people in technology , and MIT’s Technology Review magazine included his work on their TR10 list of the 10 technologies “most likely to change our world”. Hellerstein is the co-founder and Chief Strategy Officer of Trifacta, a software vendor providing intelligent interactive solutions to the messy problem of wrangling data. He has served on the technical advisory boards of a number of computing and Internet companies including Dell EMC, SurveyMonkey, Captricity, and Datometry, and previously served as the Director of Intel Research, Berkeley.

Type: Talk, Level: Intermediate, Focus Area: AI for Good, Machine Learning

What happened with the Cloud – no app
Parallelism – distributed computers – scale up or down, consistency and partial failure
Serverless Computing: Functions-as-a-Service (FaaS)
Developers OUTSIDE AWS, AZURE< Google to program the CLoud
Python for the Cloud
AutoScaling – yes
Limitations of FaaS: AWS Lambda: I/O Bottlenecks, lifetine 15 min, No Inbound Network COmmunication
Program State: local data – managed across invocations
Data Gravity – expensive to move

Distributed consistency – data replication: Agree on data value mutable variable x [undate took place]

Two-Phase commit [ Consensus – Paxos]
coordination avoidance: waiting for control TALL LATENCY- DISTRIBUTION OF PERFORMANCE
Slowdown cascades: I/O
Application semantics: Programs requires coordination
Program must have property of Monotonic
MONOTONICITY: Input grows/output grows – wait on information not on Coordination

CALM – infinitely-scalable systems – no coordination ->> parallelism and smooth scalability

Monotonicity syntactically in a logic language

Hydro: a Platform for Programming the Cloud

Anna Serverless KVS – Hydro Project

shared-nothing at all scales (even across Threads)
Fast under contention: 90% request handling

Cloudburst: A stateful Serverless Platform: CACHE close to compute: Cache consistency

Latency Python, Cloudburst, AWS, AWS Lambda:

AWS Lambda is SLOW for AI vs Python, Cloudburst

Scalable AWS Lambda simultaneously

Motion planning compute
Cloudburst + Anna requirement

@joe_hellerstein

Bloom Lab

RiseLab

Hydro

Just Machine Learning

Fri, October 30, 1:10 PM
(PDT)