12,399 views

OGAWA, Tadashi

@ogawa_tter

, 37 tweets, 53 min read

My Authors

=>
"Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels", PMBS 2019 conferences.computer.org/sc19w/2019/pdf…
Slides dcs.warwick.ac.uk/pmbs/pmbs19/PM…
Open Source Architecture Code Analyzer github.com/RRZE-HPC/OSACA
x86 and ARM
Best Late-Breaking Paper Award blogs.fau.de/hager/archives…

https://twitter.com/ogawa_tter/status/1196459854509985795

https://twitter.com/ogawa_tter/status/1196459854509985795

=>
"Performance Portability of a Wilson Dslash Stencil Operator Mini-App Using Kokkos and SYCL", Jefferson Lab, NVIDIA, Intel, Sandia, NERSC, P3HPC 2019
PDF conferences.computer.org/sc19w/2019/pdf…
Slides drive.google.com/file/d/1rBIzzd…
V100, K80
Skylake
KNL
Intel Gen9 GPU
oneAPI

https://twitter.com/ogawa_tter/status/1196459854509985795

https://twitter.com/ogawa_tter/status/1201714725341388800

https://twitter.com/ogawa_tter/status/1201714725341388800

=>
"Performance Portability of Multi-Material Kernels", P3HPC 2019
conferences.computer.org/sc19w/2019/pdf…
Slides drive.google.com/file/d/12asEc4…
OpenMP, OpenACC, CUDA, SYCL, KOKKOS,
ThunderX2
POWER9
Broadwell, Skylake, KNL
P100, V100
(Radeon VII)
scholar.google.co.uk/citations?hl=e…

https://twitter.com/ogawa_tter/status/1201714725341388800

oneAPI

https://twitter.com/ogawa_tter/status/1197305719432732674

https://twitter.com/ogawa_tter/status/1197305719432732674

=>
Intel oneAPI
Raja Koduri (+ Xe) Nov 17, 2019

https://twitter.com/ogawa_tter/status/1197305719432732674

https://twitter.com/ogawa_tter/status/1196459854509985795

Performance Portability, P3HPC 2019

https://twitter.com/ogawa_tter/status/1201714725341388800

https://twitter.com/ogawa_tter/status/1201717223531765760

Codeplay, SC19: supporting SYCL, Data Parallel C++ & oneAPI on NVIDIA GPUs

https://twitter.com/codeandrew/status/1196507585748307968

@simonmcs

@simonmcs

=>
"Performance Portability Across Diverse Computer Architectures", University of Bristol, P3HPC 2019
Paper sc19.supercomputing.org/proceedings/wo…
Slides drive.google.com/file/d/1VcF5YB…
@simonmcs , Opening Keynote, DoE PPP, Apr 2, 2019 drive.google.com/file/d/127YwGA…
HPC - U of Bristol uob-hpc.github.io

https://twitter.com/ogawa_tter/status/1190170841738137600

https://twitter.com/ogawa_tter/status/1190170841738137600

https://twitter.com/ogawa_tter/status/1190170841738137600

*) Loop back ...

https://twitter.com/ogawa_tter/status/1196459854509985795

https://twitter.com/ogawa_tter/status/1196459854509985795

=>
On Demand, Dec 2019
Introducing oneAPI: A Single Programming Model to Deliver Cross Architecture Performance techdecoded.intel.io/essentials/int…
PDF simplecore-ger.intel.com/techdecoded/wp…
Introducing Data Parallel C++ (DPC++ Part 1) techdecoded.intel.io/essentials/dpc…
simplecore-ger.intel.com/techdecoded/wp…

https://twitter.com/ogawa_tter/status/1196459854509985795

https://twitter.com/ogawa_tter/status/1203019804044431360

https://twitter.com/ogawa_tter/status/1203019804044431360

=>
Intel DPC++ Compatibility tool (Beta)
Minimizes code migration time
Assists developers migrating code written in CUDA to DPC++ once, generating human readable code wherever possible
~80 - 90% of code migrates automatically

Introducing oneAPI, Dec 2019

https://twitter.com/ogawa_tter/status/1203019804044431360

https://twitter.com/ogawa_tter/status/1203019804044431360

https://twitter.com/ogawa_tter/status/1203019804044431360

=>
oneAPI Level 0 Specification spec.oneapi.com/oneL0/
Objective spec.oneapi.com/oneL0/core_INT…
Level-Zero API is to provide direct-to-metal interfaces to offload accelerator devices.
GPU, Accelerator, FPGA

oneAPI, Ondemand Webinars, Dec 2019

https://twitter.com/ogawa_tter/status/1203019804044431360

https://twitter.com/ogawa_tter/status/1144172890952065024

https://twitter.com/ogawa_tter/status/1144172890952065024

=>
"Exploiting Mixed Precision Arithmetic for Fast Solution of Positive Definite Linear Systems", Nicholas Higham, Manchester, UBC Distinguished Lectur, Dec 5 2019 youtube.com/watch?v=wLr62q…
dropbox.com/s/mt004qqa7qut…
bf16, Intel

https://twitter.com/ogawa_tter/status/1144172890952065024

TPUv2/v3

https://twitter.com/ogawa_tter/status/1192329877963149312

=>
"Tools for Monitoring CPU Usage and Affinity in Multicore Supercomputers", TACC, HUST 2019 (SC19)
PDF sc19.supercomputing.org/proceedings/wo…
Slides hust-workshop.github.io/2019_presentat…
core_usage github.com/TACC/core_usage
show_affinity github.com/TACC/show_affi…
amask github.com/tacc/amask

=>
"Bringing NVIDIA GPU support to SYCL developers", Dec 16, 2019 codeplay.com/portal/12-16-1…
"experimental" support for NVIDIA with ComputeCpp using CUDA
=> In 2020, Codeplay will be releasing a fully open source project supporting for NVIDIA devices to SYCL developers
with oneAPI

https://twitter.com/ogawa_tter/status/1203019804044431360

https://twitter.com/ogawa_tter/status/1203019804044431360

=>
[ oneAPI Ondemand Webinar ]
Dec 3, 4, 2019
Introducing oneAPI
DPC++ Part 1: Intro

https://twitter.com/ogawa_tter/status/1203019804044431360

Dec 11, 2019
DPC++ Part 2: Programming Model: Best Practices techdecoded.intel.io/essentials/dpc…
Slides simplecore-ger.intel.com/techdecoded/wp…

Intel oneAPI Toolkits (Beta) software.intel.com/en-us/oneapi

=>
A European Dialect For Exascale Programming, Jan 6 2020 nextplatform.com/2020/01/06/a-e…
Euro joint Effort toward a Highly Productive Programming Environment for Heterogeneous Exascale Computing epeec-project.eu
Productivity at Exascale, PRACE booth, SC19 prace-ri.eu/IMG/pdf/EPEEC_…

=>
"On the Correct Measurement of Application Memory Bandwidth and Memory Access Latency", HPC Asia 2020 dl.acm.org/doi/abs/10.114…
github.com/helchr/perfMem…
comet: HWL, E5-2699v3, 1847Mhz
spica BWL, E7-8890v4, 1600Mhz
arcturus: BWL, E5-2699v4, 2400Mhz
rigel: SKL, Xeon 8176, 2666Mhz

https://twitter.com/ogawa_tter/status/1205218095213662208

https://twitter.com/ogawa_tter/status/1205218095213662208

=>
"Gathering Intel on Intel AVX-512 Transitions", Jan 17, 2020 travisdowns.github.io/blog/2020/01/1…
"examine the AVX-512 downclocking behavior using target tests"
Xeon W-2104
3.2, 2.8, 2.4 GHz

CPU & DRAM

https://twitter.com/ogawa_tter/status/1205218095213662208

Myth of Peak Performance (Cascade Lake)

https://twitter.com/ogawa_tter/status/1204810893952815105

https://twitter.com/ogawa_tter/status/1158578230988701697

https://twitter.com/ogawa_tter/status/1158578230988701697

=>
"AsmDB: Understanding and Mitigating Front-end Stalls in Warehouse-Scale Computers", ..., Parthasarathy Ranganathan (Google), ISCA 2019, PDF liberty.princeton.edu/Publications/i…
∗Work performed while these authors were at Google.
Memory Hierarchy for .,, HPCA 2018,

https://twitter.com/ogawa_tter/status/1158578230988701697

=>
"Avoiding Mistakes in Large Scale Performance Analysis", Kingsum Chow, Alibaba, IMPACT 2019, Feb 21, 2019 slideslive.com/38913881/avoid…
"Performance Analysis of Alibaba Large-Scale Data Center", Jianmei Guo, Senior technical expert, Alibaba Blog, Apr 11, 2019 alibabacloud.com/blog/performan…

https://twitter.com/ogawa_tter/status/743952669778161668

https://twitter.com/ogawa_tter/status/743952669778161668

=>
"Scalability of performance monitoring", Google, Linux Plumbers Conference 2019, PDF linuxplumbersconf.org/event/4/contri…
Borg, EuroSys'15

https://twitter.com/ogawa_tter/status/743952669778161668

"Google Flex", John Wilkes, May 2018

https://twitter.com/ogawa_tter/status/1002940131500699653

Google-Wide Profiling, IEEE Micro 2010 PDF storage.googleapis.com/pub-tools-publ…

https://twitter.com/ogawa_tter/status/1218971459818643456

https://twitter.com/ogawa_tter/status/1218971459818643456

=>
"Hardware Performance Monitoring Landscape", Stephane Eranian, Google, Keynote, ProTools19 protools19.github.io/slides/Eranian…
Scalability of performance monitoring

https://twitter.com/ogawa_tter/status/1218971459818643456

Google-Wide Profiling, Borg
AsmDB, ISCA 2019

https://twitter.com/ogawa_tter/status/1218965628976893952

Alibaba

https://twitter.com/ogawa_tter/status/1218965891037057025

https://twitter.com/ogawa_tter/status/1219305558273998848

https://twitter.com/ogawa_tter/status/1219305558273998848

=>
"How TMA (Top-down Microarchitecture Analysis) Addresses Challenges in Modern Servers and Enhancements Coming in IceLake", Ahmad Yasin, Intel, Scalable Tools WS, Jul 10, 2018 dyninst.github.io/scalable_tools…
Reversim Summit, Jun 17, 2019 youtube.com/watch?v=NabAgE…

https://twitter.com/ogawa_tter/status/1219305558273998848

https://twitter.com/ogawa_tter/status/1219373557387448320

https://twitter.com/ogawa_tter/status/1219373557387448320

=>
"A Metric-Guided Method (MGM) for Discovering Impactful Features and Architectural Insights for Skylake-Based Processors", A. Yasin (U of Haifa and Intel), et al., ACM TACO, Dec 2019 dl.acm.org/doi/abs/10.114…
Top-down Microarchitecture Analysis, A. Yasin

https://twitter.com/ogawa_tter/status/1219373557387448320

https://twitter.com/ogawa_tter/status/1206730245618880512

https://twitter.com/ogawa_tter/status/1206730245618880512

=>
Intel oenAPI, CERN openlab Technical Workshop, Jan 23, 2020 indico.cern.ch/event/853334/c…
Single Programming Model to Deliver Cross-Architecture Performance
Ondemand Webinars, Dec 2029

https://twitter.com/ogawa_tter/status/1206730245618880512

"Exascale for Everyone", @Rajaontheedge Nov 2019

https://twitter.com/ogawa_tter/status/1207633224962166791

@Rajaontheedge

@Rajaontheedge

@Rajaontheedge =>
"First experience of CMS reconstruction with OneAPI", CERN, CERN openlab Technical Workshop, Jan 23, 2020 indico.cern.ch/event/853334/c…
Host Emulation device: 5888.52 us
Core i9-9900K, 3.60GHz: 5728.53 us
Gen9 HD Graphics NEO: 1653.66 us
oneAPI, Jan 23, 2020

https://twitter.com/ogawa_tter/status/1220329157793501190

https://twitter.com/ogawa_tter/status/1204151158622711808

https://twitter.com/ogawa_tter/status/1204151158622711808

=>
"Empirical study of Amdahl's law on multicore processors", Research Report, Dec 11, 2019 hal.inria.fr/hal-02404346
2x [Xeon Gold 6130 + (2x 16 GiB DDR4 (2666 MT/s)]
Peak: 85:33GB/s
GCC, ICC & Clang, OpenMP
Kairos team.inria.fr/kairos/
Energy Efficiency

https://twitter.com/ogawa_tter/status/1204151158622711808

https://twitter.com/ogawa_tter/status/1103890090047074304

https://twitter.com/ogawa_tter/status/1103890090047074304

=>
"Slurmのジョブスケジューリングと実装"、東京大学　坂本龍一、Slurm User Group Meetup Tokyo #SUGMT 、2020年1月22日 slideshare.net/RyuichiSakamot…
Presentations from Slurm UG Meeting slurm.schedmd.com/publications.h…
Slurm on GCP, Mar 2019

https://twitter.com/ogawa_tter/status/1103890090047074304

IPDPS 2017

https://twitter.com/ogawa_tter/status/1220026063171932160

https://twitter.com/ogawa_tter/status/1220147202913210368

https://twitter.com/ogawa_tter/status/1220147202913210368

=>
"MLIR Tutorial: Building a Compiler with MLIR", Google, WS on MLIR for HPC, Oct 21, 2019, PDF cs.utah.edu/~mhall/mlir4hp…
github.com/tensorflow/mlir
=> mlir.llvm.org
github.com/llvm/llvm-proj…
High-Performance Compilation Opportunities in MLIR, Jan 2020

https://twitter.com/ogawa_tter/status/1220147202913210368

https://twitter.com/ogawa_tter/status/1224290431023730689

https://twitter.com/ogawa_tter/status/1224290431023730689

=>
"Flang: The Fortran Frontend of LLVM", FOSDEM, Feb 1, 2020 fosdem.org/2020/schedule/…
PDF fosdem.org/2020/schedule/…
New Flang/F18 github.com/flang-compiler…

"An MLIR Dialect for High-Level Optimization of Fortran", NVIDIA, Oct 2019 youtube.com/watch?v=ff3ngd…
MLIR

https://twitter.com/ogawa_tter/status/1224290431023730689

https://twitter.com/ogawa_tter/status/1220329157793501190

https://twitter.com/ogawa_tter/status/1220329157793501190

=>
"Codeplay contribution to DPC++ brings SYCL support for NVIDIA GPUs", Feb 3, 2020 codeplay.com/portal/02-03-2…
Adding to the DPC++ compiler is based directly on NVIDIA's CUDA, rather than OpenCL
oneAPI, Jan 2020

https://twitter.com/ogawa_tter/status/1220329157793501190

Ondemand Webinars, Dec 2019
Intro, DPC++ 1 & 2

https://twitter.com/ogawa_tter/status/1108251335533592576

https://twitter.com/ogawa_tter/status/1108251335533592576

=>
Performance Analysis of the AMD EPYC Rome, J. Munoz, Cardiff U, CIUK 2019, Dec 5, 2019, (91 pp) scd.stfc.ac.uk/SiteAssets/Pag…
AMD EPYC 7452 / 7502 / (7742)
Skylake Gold 6148
Molecular Dynamics: 5s
Quantum Espressoand, VASP
GAMESS-UK/US
OpenFOAM
NEMO
CIUK 2018

https://twitter.com/ogawa_tter/status/1108251335533592576

=>
"To OpenACC 3.0, and Beyond!", Jeff Larkin, NVIDIA and OpenACC Technical Committee Chair, UTK ICL Lunch Seminar, Jan 24, 2020, PDF icl.cs.utk.edu/newsletter/pre…
OpenACC, Ver 3.0, Dec 18, 2019 openacc.org/blog/openacc-30
Spec, Nov 2019, PDF openacc.org/sites/default/…
devblogs.nvidia.com/author/jlarkin/

https://twitter.com/ogawa_tter/status/1214603923782135808

https://twitter.com/ogawa_tter/status/1214603923782135808

=>
"KUtrace 2020", Richard L. Sites, Stanford EE380, Feb 5 2020 youtube.com/watch?v=2HE7tS…
web.stanford.edu/class/ee380/Ab…
github.com/dicksites/KUtr…
Low-overhead tracing of all Linux kernel-user transitions
Alpha, Apr 1992

https://twitter.com/ogawa_tter/status/1214603923782135808

DC Computers, Feb 2015

https://twitter.com/ogawa_tter/status/583793250739167232

=>
SIAM Conference on Parallel Processing for Scientific Computing (PP20), Feb 12-15, 2020 siam.org/conferences/cm…
Advances in Algorithms Exploiting Low Precision Floating-Point Arithmetic
I of II, Feb 12 meetings.siam.org/sess/dsp_progr…
II of II, Feb 13 meetings.siam.org/sess/dsp_progr…

@codeplaysoft

@codeplaysoft

=>
"SYCL in HPC", @codeplaysoft , WS on Efficient Computing for HEP. Feb 18, 2020, PDF indico.ph.ed.ac.uk/event/66/contr…
Acoran

https://twitter.com/ogawa_tter/status/1196480002176405508

codeplay.com/products/acora…
Codeplay contribution to DPC++ (oneAPI) brings SYCL support for NVIDIA GPUs, Feb 3, 2020

https://twitter.com/ogawa_tter/status/1224315006054830080

https://twitter.com/ogawa_tter/status/1220329157793501190

https://twitter.com/ogawa_tter/status/1220329157793501190

=>
[ Webinar ] Intel oneAPI, Analyzer Training, IXPUB, Feb 20, 2020 ixpug.org/resources/perf…
Slides ixpug.org/resources/down…
Intel Advisor
Offload Advisor
GPU Roofline Analysis
Intel VTune Profiler
GPU Analysis
Analyzing GPU
oenAPI, Jan 23, 2020

https://twitter.com/ogawa_tter/status/1220329157793501190

Webinars

=>
"AVX-512 Auto-Vectorization in MSVC", Feb 27, 2020 devblogs.microsoft.com/cppblog/avx-51…
"In Visual Studio 2019 version 16.3 we added AVX-512 support to the auto-vectorizer of the MSVC compiler."
What is the auto vectorizer?
What is AVX-512?
How to enable AVX-512 vectorization?

https://twitter.com/ogawa_tter/status/1220147202913210368

https://twitter.com/ogawa_tter/status/1220147202913210368

=>
"High Performance Code Generation in MLIR: An Early Case Study with GEMM", Uday Bondhugula, arXiv, Mar 1, 2020 arxiv.org/abs/2003.00532
polymagelabs.com/index.html
IMPACT 2020

https://twitter.com/ogawa_tter/status/1220147202913210368

Polyhedral

https://twitter.com/ogawa_tter/status/1219569529778360322

https://twitter.com/ogawa_tter/status/1235967046032080896

A New Golden Age for .

Enjoying this thread?

Try unrolling a thread yourself!

Trending hashtags

Enjoying this thread?

Try unrolling a thread yourself!

More from @ogawa_tter see all

Related threads

Trending hashtags

Embed code for your website

Did Thread Reader help you today?