Jekyll2019-01-24T21:56:38+00:00https://shalab.usc.edu/feed.xmlShaLab:triangular_ruler: A flexible two-column Jekyll theme. Perfect for personal sites, blogs, and portfolios hosted on GitHub or your own server.ShaLabWriting Distributed Applications with PyTorch2017-09-02T00:00:00+00:002017-09-02T00:00:00+00:00https://shalab.usc.edu/writing-distributed-applications-with-pytorch<p><strong>Abstract</strong>
In this short tutorial, we will be going over the distributed package of PyTorch. We’ll see how to set up the distributed setting, use the different communication strategies, and go over part of the internals of the package.</p>
<h1 id="setup">Setup</h1>
<!--
* Processes & machines
* variables and init_process_group
-->
<p>The distributed package included in PyTorch (i.e., <code class="highlighter-rouge">torch.distributed</code>) enables researchers and practitioners to easily distribute their computations across processes and clusters of machines. To do so, it leverages the messaging passing semantics allowing each process to communicate data to any of the other processes. As opposed to the multiprocessing (<code class="highlighter-rouge">torch.multiprocessing</code>) package, processes can use different communication backends and are not restricted to being executed on the same machine.</p>
<p>In order to get started we should thus be able to run multiple processes simultaneously. If you have access to compute cluster you should check with your local sysadmin or use your favorite coordination tool. (e.g., <a href="https://linux.die.net/man/1/pdsh">pdsh</a>, <a href="http://cea-hpc.github.io/clustershell/">clustershell</a>, or <a href="https://slurm.schedmd.com/">others</a>) For the purpose of this tutorial, we will use a single machine and can fork multiple processes using the following template.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">"""run.py:"""</span>
<span class="c">#!/usr/bin/env python</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">import</span> <span class="nn">torch.distributed</span> <span class="k">as</span> <span class="n">dist</span>
<span class="kn">from</span> <span class="nn">torch.multiprocessing</span> <span class="kn">import</span> <span class="n">Process</span>
<span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="n">rank</span><span class="p">,</span> <span class="n">size</span><span class="p">):</span>
<span class="s">""" Distributed function to be implemented later. """</span>
<span class="k">pass</span>
<span class="k">def</span> <span class="nf">init_processes</span><span class="p">(</span><span class="n">rank</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">fn</span><span class="p">,</span> <span class="n">backend</span><span class="o">=</span><span class="s">'tcp'</span><span class="p">):</span>
<span class="s">""" Initialize the distributed environment. """</span>
<span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s">'MASTER_ADDR'</span><span class="p">]</span> <span class="o">=</span> <span class="s">'127.0.0.1'</span>
<span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s">'MASTER_PORT'</span><span class="p">]</span> <span class="o">=</span> <span class="s">'29500'</span>
<span class="n">dist</span><span class="o">.</span><span class="n">init_process_group</span><span class="p">(</span><span class="n">backend</span><span class="p">,</span> <span class="n">rank</span><span class="o">=</span><span class="n">rank</span><span class="p">,</span> <span class="n">world_size</span><span class="o">=</span><span class="n">size</span><span class="p">)</span>
<span class="n">fn</span><span class="p">(</span><span class="n">rank</span><span class="p">,</span> <span class="n">size</span><span class="p">)</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
<span class="n">size</span> <span class="o">=</span> <span class="mi">2</span>
<span class="n">processes</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">rank</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">size</span><span class="p">):</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">Process</span><span class="p">(</span><span class="n">target</span><span class="o">=</span><span class="n">init_processes</span><span class="p">,</span> <span class="n">args</span><span class="o">=</span><span class="p">(</span><span class="n">rank</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">run</span><span class="p">))</span>
<span class="n">p</span><span class="o">.</span><span class="n">start</span><span class="p">()</span>
<span class="n">processes</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">p</span><span class="p">)</span>
<span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">processes</span><span class="p">:</span>
<span class="n">p</span><span class="o">.</span><span class="n">join</span><span class="p">()</span>
</code></pre></div></div>
<p>In the above, the script spawns two processes who will each setup the distributed environment, initialize the process group (<code class="highlighter-rouge">dist.init_process_group</code>), and finally execute the given function.</p>
<p>The <code class="highlighter-rouge">init_processes</code> function is what interests us for now. It ensures that every process will be able to coordinate through a master, using the same ip address and port. Note that we used the TCP backend, but we could have used <a href="https://en.wikipedia.org/wiki/Message_Passing_Interface">MPI</a> or <a href="http://github.com/facebookincubator/gloo">Gloo</a> instead, provided they are installed. We will go over the magic happening in <code class="highlighter-rouge">dist.init_process_group</code> at the end of this tutorial, but it essentially allows processes to communicate with each other by sharing their locations.</p>
<h1 id="point-to-point-communication">Point-to-Point Communication</h1>
<!--
* send/recv
* isend/irecv
-->
<table>
<tbody>
<tr>
</tr><tr>
<td align="center">
<img src='http://seba-1511.github.io/dist_tuto.pth/figs/send_recv.png' width=100% /><br />
<b>Send and Recv</b>
</td>
</tr>
</tbody>
</table>
<p>A transfer of data from one process to another is called a point-to-point communication. These are achieved through the <code class="highlighter-rouge">send</code> and <code class="highlighter-rouge">recv</code> functions or their <em>immediate</em> counter-parts, <code class="highlighter-rouge">isend</code> and <code class="highlighter-rouge">irecv</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">"""Blocking point-to-point communication."""</span>
<span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="n">rank</span><span class="p">,</span> <span class="n">size</span><span class="p">):</span>
<span class="n">tensor</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="k">if</span> <span class="n">rank</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">tensor</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="c"># Send the tensor to process 1</span>
<span class="n">dist</span><span class="o">.</span><span class="n">send</span><span class="p">(</span><span class="n">tensor</span><span class="o">=</span><span class="n">tensor</span><span class="p">,</span> <span class="n">dst</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="c"># Receive tensor from process 0</span>
<span class="n">dist</span><span class="o">.</span><span class="n">recv</span><span class="p">(</span><span class="n">tensor</span><span class="o">=</span><span class="n">tensor</span><span class="p">,</span> <span class="n">src</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Rank '</span><span class="p">,</span> <span class="n">rank</span><span class="p">,</span> <span class="s">' has data '</span><span class="p">,</span> <span class="n">tensor</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
</code></pre></div></div>
<p>In the above example, both processes start with a zero tensor, then process 0 increments the tensor and sends it to process 1 so that they both end up with 1.0. Notice that process 1 needs to allocate memory in order to store the data it will receive.</p>
<p>Also notice that <code class="highlighter-rouge">send</code>/<code class="highlighter-rouge">recv</code> are <strong>blocking</strong>: both processes stop until the communication is completed. Immediates on the other hand are <strong>non-blocking</strong>, the script continues its execution and the methods return a <code class="highlighter-rouge">DistributedRequest</code> object upon which we can choose to <code class="highlighter-rouge">wait()</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">"""Non-blocking point-to-point communication."""</span>
<span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="n">rank</span><span class="p">,</span> <span class="n">size</span><span class="p">):</span>
<span class="n">tensor</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="n">req</span> <span class="o">=</span> <span class="bp">None</span>
<span class="k">if</span> <span class="n">rank</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">tensor</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="c"># Send the tensor to process 1</span>
<span class="n">req</span> <span class="o">=</span> <span class="n">dist</span><span class="o">.</span><span class="n">isend</span><span class="p">(</span><span class="n">tensor</span><span class="o">=</span><span class="n">tensor</span><span class="p">,</span> <span class="n">dst</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Rank 0 started sending'</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="c"># Receive tensor from process 0</span>
<span class="n">req</span> <span class="o">=</span> <span class="n">dist</span><span class="o">.</span><span class="n">irecv</span><span class="p">(</span><span class="n">tensor</span><span class="o">=</span><span class="n">tensor</span><span class="p">,</span> <span class="n">src</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Rank 1 started receiving'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Rank 1 has data '</span><span class="p">,</span> <span class="n">tensor</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">req</span><span class="o">.</span><span class="n">wait</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Rank '</span><span class="p">,</span> <span class="n">rank</span><span class="p">,</span> <span class="s">' has data '</span><span class="p">,</span> <span class="n">tensor</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
</code></pre></div></div>
<p>Running the above function a couple of times will sometimes result in process 1 still having 0.0 while having already started receiving. However, after <code class="highlighter-rouge">req.wait()</code> has been executed we are guaranteed that the communication took place.</p>
<p>Point-to-point communication is useful when we want a fine-grained control over the communication of our processes. They can be used to implement fancy algorithms, such as the one used in <a href="https://github.com/baidu-research/baidu-allreduce">Baidu’s DeepSpeech</a> or <a href="https://research.fb.com/publications/imagenet1kin1h/">Facebook’s large-scale experiments</a>.</p>
<h1 id="collective-communication">Collective Communication</h1>
<!--
* gather
* reduce
* broadcast
* scatter
* all_reduce
-->
<table>
<tbody>
<tr>
<td align="center">
<img src='http://seba-1511.github.io/dist_tuto.pth/figs/scatter.png' width=100% /><br />
<b>Broadcast</b>
</td>
<td align="center">
<img src='http://seba-1511.github.io/dist_tuto.pth/figs/all_gather.png' width=100% /><br />
<b>AllGather</b>
</td>
</tr><tr>
<td align="center">
<img src='http://seba-1511.github.io/dist_tuto.pth/figs/reduce.png' width=100% /><br />
<b>Reduce</b>
</td>
<td align="center">
<img src='http://seba-1511.github.io/dist_tuto.pth/figs/all_reduce.png' width=100% /><br />
<b>AllReduce</b>
</td>
</tr>
</tbody>
</table>
<p>As opposed to point-to-point communcation, collectives allow for communication patterns across all processes in a <strong>group</strong>. A group is a subset of all your processes. To create a group, we can pass a list of ranks to <code class="highlighter-rouge">dist.new_group(group)</code>. By default, collectives are executed on the all processes, also known as the <strong>world</strong>. Then, in order to obtain the sum of all tensors at all processes, we can use the <code class="highlighter-rouge">dist.all_reduce(tensor, op, group)</code> collective.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">""" All-Reduce example."""</span>
<span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="n">rank</span><span class="p">,</span> <span class="n">size</span><span class="p">):</span>
<span class="s">""" Simple point-to-point communication. """</span>
<span class="n">group</span> <span class="o">=</span> <span class="n">dist</span><span class="o">.</span><span class="n">new_group</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span>
<span class="n">tensor</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="n">dist</span><span class="o">.</span><span class="n">all_reduce</span><span class="p">(</span><span class="n">tensor</span><span class="p">,</span> <span class="n">op</span><span class="o">=</span><span class="n">dist</span><span class="o">.</span><span class="n">reduce_op</span><span class="o">.</span><span class="n">SUM</span><span class="p">,</span> <span class="n">group</span><span class="o">=</span><span class="n">group</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Rank '</span><span class="p">,</span> <span class="n">rank</span><span class="p">,</span> <span class="s">' has data '</span><span class="p">,</span> <span class="n">tensor</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
</code></pre></div></div>
<p>Since we wanted the sum of all tensors in the group, we used <code class="highlighter-rouge">dist.reduce_op.SUM</code> as the reduce operator. Generally speaking, any commutative mathematical operation can be used as an operator. PyTorch comes with 4 out-of-the-box, all working at the element-wise level:</p>
<ul>
<li><code class="highlighter-rouge">dist.reduce_op.SUM</code>,</li>
<li><code class="highlighter-rouge">dist.reduce_op.PRODUCT</code>,</li>
<li><code class="highlighter-rouge">dist.reduce_op.MAX</code>,</li>
<li><code class="highlighter-rouge">dist.reduce_op.MIN</code>.</li>
</ul>
<p>In addition to <code class="highlighter-rouge">dist.all_reduce(tensor, op, group)</code>, there are a total of 4 collectives that are currently implemented in PyTorch.</p>
<ul>
<li><code class="highlighter-rouge">dist.broadcast(tensor, src, group)</code>: Copies tensor from src to all other processes.</li>
<li><code class="highlighter-rouge">dist.reduce(tensor, dst, op, group)</code>: Applies op to all tensor and stores the result at dst.</li>
<li><code class="highlighter-rouge">dist.all_reduce(tensor, op, group)</code>: Same as reduce, but the result is stored at all processes.</li>
<li><code class="highlighter-rouge">dist.all_gather(tensor_list, tensor, group)</code>: Copies tensor from all processes to tensor_list, on all processes.</li>
</ul>
<h3 id="what-about-scatter-and-gather-">What about scatter and gather ?</h3>
<table>
<tbody>
<tr>
</tr><tr>
<td align="center">
<img src='http://seba-1511.github.io/dist_tuto.pth/figs/scatter.png' width=100% /><br />
<b>Scatter</b>
</td>
<td align="center">
<img src='http://seba-1511.github.io/dist_tuto.pth/figs/gather.png' width=100% /><br />
<b>Gather</b>
</td>
</tr>
</tbody>
</table>
<p>Those familiar with MPI will have noticed that the gather and scatter methods are absent from the current API. However, PyTorch exposes</p>
<ul>
<li><code class="highlighter-rouge">dist.scatter_send(tensor_list, tensor, group)</code>,</li>
<li><code class="highlighter-rouge">dist.scatter_recv(tensor, dst, group)</code>,</li>
<li><code class="highlighter-rouge">dist.gather_send(tensor_list, tensor, group)</code>, and</li>
<li><code class="highlighter-rouge">dist.gather_recv(tensor, dst, group)</code></li>
</ul>
<p>which can be used to implement the standard scatter and gather behaviours.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">""" Custom scatter and gather implementation. """</span>
<span class="k">def</span> <span class="nf">scatter</span><span class="p">(</span><span class="n">tensor</span><span class="p">,</span> <span class="n">tensor_list</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="n">root</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">group</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
<span class="s">"""
Sends the ith tensor in tensor_list on root to the ith process.
"""</span>
<span class="n">rank</span> <span class="o">=</span> <span class="n">dist</span><span class="o">.</span><span class="n">get_rank</span><span class="p">()</span>
<span class="k">if</span> <span class="n">group</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
<span class="n">group</span> <span class="o">=</span> <span class="n">dist</span><span class="o">.</span><span class="n">group</span><span class="o">.</span><span class="n">WORLD</span>
<span class="k">if</span> <span class="n">rank</span> <span class="o">==</span> <span class="n">root</span><span class="p">:</span>
<span class="k">assert</span><span class="p">(</span><span class="n">tensor_list</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">)</span>
<span class="n">dist</span><span class="o">.</span><span class="n">scatter_send</span><span class="p">(</span><span class="n">tensor_list</span><span class="p">,</span> <span class="n">tensor</span><span class="p">,</span> <span class="n">group</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">dist</span><span class="o">.</span><span class="n">scatter_recv</span><span class="p">(</span><span class="n">tensor</span><span class="p">,</span> <span class="n">root</span><span class="p">,</span> <span class="n">group</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">gather</span><span class="p">(</span><span class="n">tensor</span><span class="p">,</span> <span class="n">tensor_list</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="n">root</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">group</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
<span class="s">"""
Sends tensor to root process, which store it in tensor_list.
"""</span>
<span class="n">rank</span> <span class="o">=</span> <span class="n">dist</span><span class="o">.</span><span class="n">get_rank</span><span class="p">()</span>
<span class="k">if</span> <span class="n">group</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
<span class="n">group</span> <span class="o">=</span> <span class="n">dist</span><span class="o">.</span><span class="n">group</span><span class="o">.</span><span class="n">WORLD</span>
<span class="k">if</span> <span class="n">rank</span> <span class="o">==</span> <span class="n">root</span><span class="p">:</span>
<span class="k">assert</span><span class="p">(</span><span class="n">tensor_list</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">)</span>
<span class="n">dist</span><span class="o">.</span><span class="n">gather_recv</span><span class="p">(</span><span class="n">tensor_list</span><span class="p">,</span> <span class="n">tensor</span><span class="p">,</span> <span class="n">group</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">dist</span><span class="o">.</span><span class="n">gather_send</span><span class="p">(</span><span class="n">tensor</span><span class="p">,</span> <span class="n">root</span><span class="p">,</span> <span class="n">group</span><span class="p">)</span>
</code></pre></div></div>
<h1 id="distributed-training">Distributed Training</h1>
<ul>
<li>Gloo Backend</li>
<li>Simple all_reduce on the gradients</li>
<li>Point to optimized DistributedDataParallel</li>
</ul>
<h1 id="internals">Internals</h1>
<ul>
<li>The magic behind init_process_group:</li>
</ul>
<ol>
<li>validate and parse the arguments</li>
<li>resolve the backend: name2channel.at()</li>
<li>Drop GIL & THDProcessGroupInit: instantiate the channel and add address of master from config</li>
<li>rank 0 inits master, others workers</li>
<li>master: create sockets for all workers -> wait for all workers to connect -> send them each the info about location of other processes</li>
<li>worker: create socket to master, send own info, receive info about each worker, and then handshake with each of them</li>
<li>By this time everyone has handshake with everyone.</li>
</ol>
<h3 id="acknowledgements">Acknowledgements</h3>
<ul>
<li>PyTorch docs + well written tests.</li>
</ul>
<h3 id="questions">Questions</h3>
<ul>
<li>Why scatter_send/recv and gather_send/recv ? And why no gather() / scatter() ?</li>
<li>How to get started with gloo ? Does it support ptp ?</li>
</ul>Sebastien ArnoldAbstract In this short tutorial, we will be going over the distributed package of PyTorch. We’ll see how to set up the distributed setting, use the different communication strategies, and go over part of the internals of the package.An Introduction to Distributed Deep Learning2017-09-01T00:00:00+00:002017-09-01T00:00:00+00:00https://shalab.usc.edu/introduction-to-distributed-deep-learning<p><strong>Tip</strong>: This article is also <a href="http://seba1511.com/dist_blog/article.pdf">available in PDF</a>. (without animations)</p>
<h1 id="introduction">Introduction</h1>
<p>This blog post introduces the fundamentals of distributed deep learning and presents some real-world applications. With the democratisation of deep learning methods in the last decade, large - and small ! - companies have invested a lot of efforts into distributing the training procedure of neural networks. Their hope: drastically reduce the time to train large models on even larger datasets. Unfortunately, while every commercial product takes advantage of these techniques, it is still difficult for practitioners and researchers to use them in their everyday projects. This article aims to change that by providing a theoretical and practical overview. \newline</p>
<h1 id="the-problem">The Problem</h1>
<!--
* Introduce formalism and SGD
* Variants of SGD
-->
<h2 id="formulation-and-stochastic-gradient-descent">Formulation and Stochastic Gradient Descent</h2>
<p>Let’s first define the problem that we would like to solve. We are trying to train a neural network to solve a supervised task. This task could be anything from classifying images to playing Atari games or predicting the next word of a sentence. To do that, we’ll rely on an algorithm - and its variants - from the mathematical optimization literature: <strong>stochastic gradient descent</strong>. Stochastic gradient descent (SGD) works by computing the gradient direction of the loss function we are trying to minimize with respect to the current parameters of the model. Once we know the gradient direction - aka the direction of greatest increase - we’ll take a step in the opposite direction since we are trying to minimize the final error. \newline</p>
<p>More formally, we can represent our dataset as a distribution $\chi$ from which we sample $N$ tuples of inputs and labels $(x_i, y_i) \sim \chi$. Then, given a loss function $\mathcal{L}$ (some common choices include the <a href="https://en.wikipedia.org/wiki/Mean_squared_error">mean square error</a>, the <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">KL divergence</a>, or the <a href="">negative log-likelihood</a>) we want to find the optimal set of weights $W_{opt}$ of our deep model $F$. That is,</p>
<script type="math/tex; mode=display">W_{opt} = \arg \min_{W} \mathbb{E}_{(x, y) \sim \chi}[\mathcal{L}(y, F(x; W))]</script>
<p><strong>Note</strong>
In the above formulation we are not separating the dataset in train, validation, and test sets. However, you need to do it !\newline</p>
<p>In this case, SGD will iteratively update the weights $W_t$ at timestep $t$ with $W_{t+1} = W_t - \alpha \cdot \nabla_{W_t} \mathcal{L}(y_i, F(x_i; W_t))$. Here, $\alpha$ is the learning rate and can be interpreted as the size of the step we are taking in the direction of the negative gradient. As we will see later there are algorithms that try to adaptively set the learning rate, but generally speaking it needs to be chosen by the human experimenter. \newline</p>
<p>One important thing to note is that in practice the gradient is evaluated over a set of samples called the minibatch. This is done by averaging the gradient of the loss for each sample in the minibatch. Taking the gradient over the minibatch helps in two aspects.</p>
<ol>
<li>It can be efficiently computed by <a href="https://goparallel.sourceforge.net/vectorization-feeds-need-speed/">vectorizing</a> the computations.</li>
<li>It allows us to obtain a better approximation of the <em>true</em> gradient of $\mathcal{L}(y, F(x; W))$ over $\chi$, and thus makes us converge faster.</li>
</ol>
<p>However, a very large batch size will simply result in computational overhead since your gradient will not significantly improve. Therefore, it is usual to keep it between 32 and 1024 samples, even when our dataset contains millions of examples.</p>
<h2 id="variants-of-sgd">Variants of SGD</h2>
<p>As we will now see, several variants of the gradient descent algorithm exist. They all try to improve the quality of the gradient by including more or less sophisticated heuristics. For a more in-depth treatment, I would recommend <a href="http://sebastianruder.com/optimizing-gradient-descent/">Sebastian Ruder’s excellent blog post</a> and the <a href="http://cs231n.github.io/neural-networks-3/">CS231n web page</a> on optimization.</p>
<h3 id="adding-momentum">Adding Momentum</h3>
<p>Momentum techniques simply keep track of a weighted average of previous updates, and apply it to the current one. This is akin to the momentum gained by a ball rolling downhill. In the following formulas, $\mu$ is the momentum parameter - how much previous updates we want to include in the current one.</p>
<table>
<thead>
<tr>
<th>Momentum</th>
<th>Nesterov Momentum or Accelerated Gradient @nesterov</th>
</tr>
</thead>
<tbody>
<tr>
<td><script type="math/tex">v_{t+1} = \mu \cdot v_t + \alpha \cdot \nabla \mathcal{L}</script> <script type="math/tex">W_{t+1} = W_t - v_{t+1}</script></td>
<td><script type="math/tex">v_{t+1} = \mu \cdot (\mu \cdot v + \alpha \cdot \nabla \mathcal{L}) + \alpha \cdot \nabla \mathcal{L}</script> <script type="math/tex">W_{t+1} = W_t - v_{t+1}</script></td>
</tr>
</tbody>
</table>
<p>Table: Momentum Flavors of SGD</p>
<p>Nesterov’s accelerated gradient adds <em>momentum to the momentum</em> in an attempt to look ahead for what is coming.</p>
<h3 id="adaptive-learning-rates">Adaptive Learning Rates</h3>
<p>Finding good learning rates can be an expensive process, and a skill often deemed closer to art or dark magic. The following techniques try to alleviate this problem by automatically setting the learning rate, sometimes on a per-parameter basis. The following descriptions are inspired by <a href="http://neon.nervanasys.com/index.html/optimizers.html">Nervana’s implementation</a>.</p>
<p><strong>Note</strong>
In the following formulas, $\epsilon$ is a constant to ensure numerical stability, and $\mu$ is the decay constant of the algorithm, how fast we decrease the learning rate as we converge.</p>
<table>
<thead>
<tr>
<th>Adagrad @adagrad</th>
<th>RMSProp @rmsprop</th>
</tr>
</thead>
<tbody>
<tr>
<td><script type="math/tex">s_{t+1} = s_t + (\nabla \mathcal{L})^2</script> <script type="math/tex">W_{t+1} = W_t - \frac{\alpha \cdot \nabla \mathcal{L}}{\sqrt{s_{t+1} + \epsilon}}</script></td>
<td><script type="math/tex">s_{t+1} = \mu \cdot s_t + (1 - \mu) \cdot (\nabla \mathcal{L})^2</script> <script type="math/tex">W_{t+1} = W_t - \frac{\alpha \cdot \nabla \mathcal{L}}{\sqrt{s_{t+1} + \epsilon} + \epsilon}</script></td>
</tr>
<tr>
<td><strong>Adadelta @adadelta</strong></td>
<td><strong>Adam @adam</strong></td>
</tr>
<tr>
<td><script type="math/tex">\lambda_{t+1} = \lambda_t \cdot \mu + (1 - \mu) \cdot (\nabla \mathcal{L})^2</script> <script type="math/tex">\Delta W_{t+1} = \nabla \mathcal{L} \cdot \sqrt{\frac{\delta_{t} + \epsilon}{\lambda_{t+1} + \epsilon}}</script> <script type="math/tex">\delta_{t+1} = \delta_t \cdot \mu + (1 - \mu) \cdot (\Delta W_{t+1})^2</script> <script type="math/tex">W_{t+1} = W_t - \Delta W_{t+1}</script></td>
<td><script type="math/tex">m_{t+1} = m_t \cdot \beta_m + (1 - \beta_m) \cdot \nabla \mathcal{L}</script> <script type="math/tex">v_{t+1} = v_t \cdot \beta_v + (1 - \beta_v) \cdot (\nabla \mathcal{L})^2</script> <script type="math/tex">l_{t+1} = \alpha \cdot \frac{\sqrt{1 - \beta_v^p}}{1 - \beta_m^p}</script> <script type="math/tex">W_{t+1} = W_t - l_{t+1} \frac{m_{t+1}}{\sqrt{v_{t+1}} + \epsilon}</script></td>
</tr>
</tbody>
</table>
<p>Table: Adaptively Scaling the Learning Rate</p>
<p>Where $p$ is the current epoch, that is 1 + the number of passes through the dataset.</p>
<h3 id="conjugate-gradients">Conjugate Gradients</h3>
<p>The following method tries to estimate the second order derivative of the loss function. This second order derivative - the Hessian $H$ - is most ably used in Newton’s algorithm ($W_{t+1} = W_t - \alpha \cdot H^{-1}\nabla \mathcal{L}$) and gives extremely useful information about the curvature of the loss function. Properly estimating the Hessian (and its inverse) has been a long time challenging task since the Hessian is composed of $\lvert W \rvert^2$ terms. For more information I’d recommend these papers [@dauphin;@choromanska;@martens] and chapter 8.2 of the deep learning book @dlbook. The following description was inspired by Wright and Nocedal @optibook.</p>
<p><script type="math/tex">p_{t+1} = \beta_{t+1} \cdot p_t - \nabla \mathcal{L}</script>
<script type="math/tex">W_{t+1} = \alpha \cdot p_{t+1}</script></p>
<p>Where $\beta_{t+1}$ can be computed by the Fletcher-Rieves or Hestenes-Stiefel methods. (Notice the subscript of the gradients.)</p>
<table>
<thead>
<tr>
<th>Fletcher-Rieves</th>
<th>Hestenes-Stiefel</th>
</tr>
</thead>
<tbody>
<tr>
<td><script type="math/tex">\beta_{t+1} = \frac{\nabla_{W_{t}}\mathcal{L}^T \cdot \nabla_{W_{t}}\mathcal{L}}{\nabla_{W_{t-1}}\mathcal{L}^T \cdot \nabla_{W_{t-1}}\mathcal{L}}</script></td>
<td><script type="math/tex">\beta_{t+1} = \frac{\nabla_{W_{t}}\mathcal{L}^T \cdot (\nabla_{W_{t}}\mathcal{L} - \nabla_{W_{t-1}}\mathcal{L})}{(\nabla_{W_{t}}\mathcal{L} - \nabla_{W_{t-1}}\mathcal{L})^T \cdot p_t}</script></td>
</tr>
</tbody>
</table>
<p>Table: Compute the Non-linear Conjugate Direction</p>
<h1 id="beyond-sequentiallity">Beyond Sequentiallity</h1>
<!--
* Introduce sync and async, nsync
* Introduce architectures and tricks to make it faster (quantization, residuals, ...) (parameter server, mpi, etc...)
* Tricky points
* Implementation
* FC, Convs, and RNNs
* Benchmarks
* Introduce Hogwild + async begets momentum
* Distributed Synthetic Gradients
* The case of RL: Naive, Gorila, A3C, HPC Policy Gradients
-->
<p>Let’s now delve into the core of this article: distributing deep learning. As mentioned above, when training <a href="https://openreview.net/forum?id=B1ckMDqlg">really deep models</a> on <a href="https://github.com/openimages/dataset">really large datasets</a> we need to add more parallelism to our computations. Distributing linear algebra operations on GPUs is not enough anymore, and researchers have began to explore how to use multiple machines. That’s when deep learning met <em>High-Performance Computing</em> (HPC).</p>
<h2 id="synchronous-vs-asynchronous">Synchronous vs Asynchronous</h2>
<p>There are two approaches to parallelize the training of neural networks: model parallel and data parallel. Model parallel consists of “breaking” the learning model, and place those “parts” on different computational nodes. For example, we could put the first half of the layers on one GPU, and the other half on a second one. Or, split layers in their middle and assign them to separate GPUs. While appealing, this approach is rarely used in practice because of the slow communication latency between devices. Since I am not very familiar with model parallelism, I’ll focus the rest of the blog post on data parallelism. \newline</p>
<p>Data parallelism is rather intuitive; the data is partitioned across computational devices, and each device holds a copy of the learning model - called a replica or sometimes worker. Each replica computes gradients on its shard of the data, and the gradients are combined to update the model parameters. Different ways of combining gradients lead to different algorithms and results, so let’s have a closer look.</p>
<h2 id="synchronous-distributed-sgd">Synchronous Distributed SGD</h2>
<p>In the synchronous setting, all replicas average all of their gradients at every timestep (minibatch). Doing so, we’re effectively multiplying the batch size $M$ by the number of replicas $R$, so that our <strong>overall minibatch</strong> size is $B_G = R \cdot M$. This has several advantages.</p>
<ol>
<li>The computation is completely deterministic.</li>
<li>We can work with fairly large models and large batch sizes even on memory-limited GPUs.</li>
<li>It’s very simple to implement, and easy to debug and analyze.</li>
</ol>
<p><img src="http://seba-1511.github.io/dist_blog/figs/sync.gif" alt="" /></p>
<p>This path to parallelism puts a strong emphasis on HPC, and the hardware that is in use. In fact, it will be challenging to obtain a decent speedup unless you are using industrial hardware. And even if you were using such a hardware, the choice of communication library, reduction algorithm, and other implementation details (e.g., data loading and transformation, model size, …) will have a strong effect on the kind of performance gain you will encounter. \newline</p>
<p>The following pseudo-code describes synchronous distributed SGD at the replica-level, for $R$ replicas, $T$ timesteps, and $M$ global batch size.</p>
<pre><code class="language-algo">\begin{algorithm}
\caption{Synchronous SGD}
\begin{algorithmic}
\While{$t < T$}
\State Get: a minibatch $(x, y) \sim \chi$ of size $M/R$.
\State Compute: $\nabla \mathcal{L}(y, F(x; W_t))$ on local $(x, y)$.
\State AllReduce: sum all $\nabla \mathcal{L}(y, F(x; W_t))$ across replicas into $\Delta W_t$
\State Update: $W_{t+1} = W_t - \alpha \frac{\Delta W_t}{R}$
\State $t = t + 1$
\State (Optional) Synchronize: $W_{t+1}$ to avoid numerical errors
\EndWhile
\end{algorithmic}
\end{algorithm}
</code></pre>
<h2 id="asynchronous-distributed-sgd">Asynchronous Distributed SGD</h2>
<p>The asynchronous setting is slightly more interesting from a mathematical perspective, and slightly trickier to implement in practice. Each replica will now access a shared-memory space, where the global parameters $W_t^G$ are stored. After copying the parameters in its local memory $W_t^L$, it will compute the gradients $\nabla \mathcal{L}$ and the update $\Delta W_t$ with respect to its current $W_t$. The final step is to apply $\Delta W_t^L$ to the global parameters in shared memory.</p>
<pre><code class="language-algo">\begin{algorithm}
\caption{Asynchronous SGD}
\begin{algorithmic}
\While{$t < T$}
\State Get: a minibatch $(x, y) \sim \chi$ of size $M/R$.
\State Copy: Global $W_t^G$ into local $W_t^L$.
\State Compute: $\nabla \mathcal{L}(y, F(x; W_t^L))$ on $(x, y)$.
\State Set: $\Delta W_t^L = \alpha \cdot \nabla \mathcal{L}(y, F(x; W_t^L))$
\State Update: $W_{t+1}^G = W_t^G - \Delta W_t^L$
\State $t = t + 1$
\EndWhile
\end{algorithmic}
\end{algorithm}
</code></pre>
<p>The advantage of adding asynchrony to our training is that replicas can work at their own pace, without waiting for others to finish computing their gradients. However, this is also where the trickiness resides; we have no guarantee that while one replica is computing the gradients with respect to a set of parameters, the global parameters will not have been updated by another one. If this happens, the global parameters will be updated with <strong>stale</strong> gradients - gradients computed with old versions of the parameters.</p>
<p><img src="http://seba-1511.github.io/dist_blog/figs/async.gif" alt="" /></p>
<p>In order to counter the effect of staleness, Zhang & al. @staleness-aware suggested to divide the gradients by their staleness. By limiting the impact of very stale gradients, they are able to obtain convergence almost identical to a synchronous system. In addition, they also proposed a generalization of synchronous and asynchronous SGD named <em>$n$-softsync</em>. In this case, updates to the shared global parameters are applied in batches of $n$. Note that $n = 1$ is our asynchronous training, while $n = R$ is synchronous. A corresponding alternative named <em>backup workers</em> was suggested by Chen & al. @backup-workers in the summer of 2016. \newline</p>
<p>Finally, there is another view of asynchronous training that is less often explored in the litterature. Each replica executes $k$ optimization steps locally, and keeps an aggregation of the updates. Once those $k$ steps are executed, all replicas synchronize their aggregated update and apply them to the parameters before the $k$ steps. This approach is best used with <a href="https://github.com/twitter/torch-distlearn/blob/master/lua/AllReduceEA.md">Elastic Averaging SGD</a> @easgd, and limits the frequency at which replicas need to communicate.</p>
<h2 id="implementation">Implementation</h2>
<p>Now that we have a decent understanding of the mechanics of distributed deep learning, let’s explore possible implementations.</p>
<h3 id="parameter-server-vs-tree-reductions">Parameter Server vs Tree-reductions</h3>
<p>The first decision to make is how to setup the architecture of the system. In this case, we mainly have two options: parameter server or tree-reductions. In the parameter server case, one machine is responsible for hodling and serving the global parameters to all replicas. As presented in @downpour, there can be several servers holding different parameters of the model to avoid contention, and they can themselves be hierarchically connected (eg, tree-shape in @rudra). One advantage of using parameter servers is that it’s easy to implement different levels of asynchrony. \newline</p>
<p><img src="http://seba-1511.github.io/dist_blog/figs/ps.png" alt="" /></p>
<p>However as discussed in @firecaffe, parameter servers tend to be slower and don’t scale as well as tree-reduction architectures. By tree-reduction, I mean an infrastructure where collective operations are executed without a higher-level manager process. The message-passing interface (MPI) and its collective communication operations are typical examples. I particularly appreciate this setting given that it stays close to the math, and it enables a lot of engineering optimizations. For example, one could choose the reduction algorithm based on the network topology, include specialized device-to-device communication routines, and even truly take advantage of fast interconnect hardware. One caveat: I haven’t (yet) come across a good asynchronous implementation based on tree-reductions.</p>
<h3 id="layer-types">Layer Types</h3>
<p>In a nutshell, all layer types can be supported with a single implementation. After the forward pass, we can compute the gradients of our model and then allreduce them. In particular, nothing special needs to be done for recurrent networks, as long as we include gradients for <strong>all</strong> parameters of the model. (eg, the biases, $\gamma, \beta$ for batch normalization, …) \newline</p>
<p>Few aspects should impact the design of your distributed model. The main one is to (appropriately) consider convolutions. They parallelize particularly well given that they are quite compute heavy with respect to the number of parameters they contain. This is a desirable property of the network, since you want to limit the time spent in communication - that’s simply overhead - as opposed to computation. In addition to being particularly good with spatially-correlated data, convolutions achieve just that since they re-multiply feature maps all over the input. More details on how to parallelize convolutional (and fully-connected) layers is available in @weird-trick. Another point to consider is using momentum-based optimizers with residuals and quantized weights. We will explore this trick in the next subsection.</p>
<h3 id="tricks">Tricks</h3>
<p>Over the years a few tricks were engineered in order to reduce the overhead induced by communicating and synchronizing updates. I am aware of the following short and non-exhaustive list. If you know more, please let me know !</p>
<h4 id="device-to-device-communication">Device-to-Device Communication</h4>
<p>When using GPUs, one important detail is to ensure that memory transfers are are done from device-to-device. Avoiding the transfer to host memory is not always easy, but <a href="https://devblogs.nvidia.com/parallelforall/introduction-cuda-aware-mpi/">more</a> and <a href="https://github.com/NVIDIA/nccl">more</a> libraries support it. Note that some GPU cards <sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> will not explicitly say that they support GPU-GPU communication, but you can still get it to work.</p>
<h4 id="overlapping-computation">Overlapping Computation</h4>
<p>If you are using neural networks like the rest of us, you backpropagate the gradients. Then a good idea is to start synchronizing the gradients of the current layer while computing the gradients of the next one.</p>
<h4 id="quantized-gradients">Quantized Gradients</h4>
<p>Instead of communicating the gradients with full floating-point accuracy, we can use reduced precision. Tim Dettmers @quant-8bit suggests an algorithm to do it, while Nikko Strom @quantized quantizes gradients that are above a certain value. This gave him sparse gradients - which he compressed - and in order to keep part of the information discarded at each minibatch, he builds a <em>residual</em>. This allows even small weight updates to happen, but delays them a little.</p>
<h4 id="reduction-algorithm">Reduction Algorithm</h4>
<p>As mentioned above, different reduction algorithms work best with different PICe / network topologies. (E.g., ring, butterfly, slimfly, ring-segmented) [@deepspeech; @opti-mpich; @slimfly; @ring-segmented]</p>
<h3 id="benchmarks">Benchmarks</h3>
<p>The last implementation detail I would like to mention is the way to effectively benchmark a distributed framework. There is a ferocious battle between framework developers on who is fastest and reported results might be a bit confusing. In my opinion, since we are trying to mimic the behaviour of a sequential implementation we should be looking at scalability <strong>with a fixed overall batch size</strong> $B_G$. That means we observe the speedup (time to convergence, time per epoch/batch, loss error) as we increase the number of computational devices, but make sure to rescale the local batch size by the number of replicas such that $B_G = R \cdot M$ stays constant across experiments.</p>
<!--# Recent Advancements-->
<!--## Hogwild! and Asynchronous Momentum -->
<!--## Distributed Synthetic Gradients-->
<!--## Distributed Reinforcement Learning-->
<!--# Benchmarks-->
<!--* toy problems-->
<!--* mnist -->
<!--* cifar10-->
<!--# A Live Example-->
<h1 id="conclusion">Conclusion</h1>
<p>Harnessing the power of distributed deep learning is not as difficult as it seems, and can lead to some drastic performance increase. This power should be available to everyone and not just large industrial companies. In addition, having a good understanding of how parallelized learning works might allow you to take advantage of some nice properties that would be hard to replicate in a sequential setup. Finally, I hope you learned something new through this article or, at least, you have been directed to some interesting papers.</p>
<h2 id="acknowledgements">Acknowledgements</h2>
<p>I’d like to thank Prof. Chunming Wang, Prof. Valero-Cuevas, and Pranav Rajpurkar for comments on the article and helpful discussions. I would also like to thank Prof. Crowley for supervising the semester that allowed me to write this work.</p>
<h2 id="citation">Citation</h2>
<p>Please cite this article as</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Arnold, Sébastien "An Introduction to Distributed Deep Learning", seba1511.com, 2016.
</code></pre></div></div>
<h4 id="bibtex">BibTeX</h4>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> @misc{arnold2016ddl,
author = {Arnold, Sébastien},
title = {An Introduction to Distributed Deep Learning},
year = {2016},
howpublished = {https://seba1511.com/dist_blog/}
}
</code></pre></div></div>
<h1 id="references">References</h1>
<p>Some of the relevant literature for this article. <br /></p>
<!--http://www.benfrederickson.com/numerical-optimization/-->
<!--http://lossfunctions.tumblr.com/-->
<!--http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html-->
<!--https://www.allinea.com/blog/201610/deep-learning-episode-3-supercomputer-vs-pong-->
<div class="footnotes">
<ol>
<li id="fn:1">
<p>I know that it is possible with GeForce 980s, 1080s, and both Maxwell and Pascal Titan Xs. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Seb ArnoldTip: This article is also available in PDF. (without animations)First Entry, with math2017-08-12T00:00:00+00:002017-08-12T00:00:00+00:00https://shalab.usc.edu/edge%20case/first_entry<p>Proof that math works: $\sum_0^\infty x = 0$.</p>
<p>Nested and mixed lists are an interesting beast. It’s a corner case to make sure that</p>
<ul>
<li>Lists within lists do not break the ordered list numbering order</li>
<li>Your list styles go deep enough.</li>
</ul>
<h3 id="ordered--unordered--ordered">Ordered – Unordered – Ordered</h3>
<ol>
<li>ordered item</li>
<li>ordered item
<ul>
<li><strong>unordered</strong></li>
<li><strong>unordered</strong></li>
</ul>
<ol>
<li>ordered item</li>
<li>ordered item</li>
</ol>
</li>
<li>ordered item</li>
<li>ordered item</li>
</ol>
<h3 id="ordered--unordered--unordered">Ordered – Unordered – Unordered</h3>
<ol>
<li>ordered item</li>
<li>ordered item
<ul>
<li><strong>unordered</strong></li>
<li><strong>unordered</strong></li>
</ul>
<ul>
<li>unordered item</li>
<li>unordered item</li>
</ul>
</li>
<li>ordered item</li>
<li>ordered item</li>
</ol>
<h3 id="unordered--ordered--unordered">Unordered – Ordered – Unordered</h3>
<ul>
<li>unordered item</li>
<li>unordered item
<ol>
<li>ordered</li>
<li>ordered
<ul>
<li>unordered item</li>
<li>unordered item</li>
</ul>
</li>
</ol>
</li>
<li>unordered item</li>
<li>unordered item</li>
</ul>
<h3 id="unordered--unordered--ordered">Unordered – Unordered – Ordered</h3>
<ul>
<li>unordered item</li>
<li>unordered item
<ul>
<li>unordered</li>
<li>unordered
<ol>
<li><strong>ordered item</strong></li>
<li><strong>ordered item</strong></li>
</ol>
</li>
</ul>
</li>
<li>unordered item</li>
<li>unordered item</li>
</ul>ShaLabProof that math works: $\sum_0^\infty x = 0$.Post: Image (Linked with Caption)2010-08-06T00:00:00+00:002010-08-06T00:00:00+00:00https://shalab.usc.edu/post%20formats/post-image-linked-caption<figure>
<a href="https://flic.kr/p/8wzarA"><img src="https://farm5.staticflickr.com/4134/4940462712_7c28420b27_b.jpg" alt="Foo" /></a>
<figcaption>
Stairs? Were we’re going we don’t need no stairs.
</figcaption>
</figure>ShaLabPost: Image (with Link)2010-08-05T00:00:00+00:002010-08-05T00:00:00+00:00https://shalab.usc.edu/post%20formats/post-image-linked<p><a href="https://flic.kr/p/8ww3fZ"><img src="https://farm5.staticflickr.com/4073/4939853213_33ffc0290b_b.jpg" alt="foo" /></a></p>ShaLabPost: Image (Standard)2010-08-05T00:00:00+00:002010-08-05T00:00:00+00:00https://shalab.usc.edu/post%20formats/post-image-standard<p>The preferred way of using images is placing them in the <code class="highlighter-rouge">/images/</code> directory and referencing them with an absolute path. Prepending the filename with <code class="highlighter-rouge">{{ site.url }}{{ site.baseurl }}/images/</code> will make sure your images display properly in feeds and such.</p>
<p>Standard image with no width modifier classes applied.</p>
<p><strong>HTML:</strong></p>
<div class="language-html highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt"><img</span> <span class="na">src=</span><span class="s">"{{ site.url }}{{ site.baseurl }}/images/filename.jpg"</span> <span class="na">alt=</span><span class="s">""</span><span class="nt">></span>
</code></pre></div></div>
<p><strong>or Kramdown:</strong></p>
<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">![</span><span class="nv">alt</span><span class="p">](</span><span class="sx">{{</span> site.url }}{{ site.baseurl }}/images/filename.jpg)
</code></pre></div></div>
<p><img src="https://shalab.usc.edu/images/unsplash-image-9.jpg" alt="Unsplash image 9" /></p>
<p>Image that fills page content container by adding the <code class="highlighter-rouge">.full</code> class with:</p>
<p><strong>HTML:</strong></p>
<div class="language-html highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt"><img</span> <span class="na">src=</span><span class="s">"{{ site.url }}{{ site.baseurl }}/images/filename.jpg"</span> <span class="na">alt=</span><span class="s">""</span> <span class="na">class=</span><span class="s">"full"</span><span class="nt">></span>
</code></pre></div></div>
<p><strong>or Kramdown:</strong></p>
<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">![</span><span class="nv">alt</span><span class="p">](</span><span class="sx">{{</span> site.url }}{{ site.baseurl }}/images/filename.jpg)
{: .full}
</code></pre></div></div>
<p class="full"><img src="https://shalab.usc.edu/images/unsplash-image-10.jpg" alt="Unsplash image 10" /></p>ShaLabThe preferred way of using images is placing them in the /images/ directory and referencing them with an absolute path. Prepending the filename with {{ site.url }}{{ site.baseurl }}/images/ will make sure your images display properly in feeds and such.