master.rst 10 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312
  1. .. _troubleshooting-salt-master:
  2. ===============================
  3. Troubleshooting the Salt Master
  4. ===============================
  5. Running in the Foreground
  6. =========================
  7. A great deal of information is available via the debug logging system, if you
  8. are having issues with minions connecting or not starting run the master in
  9. the foreground:
  10. .. code-block:: bash
  11. # salt-master -l debug
  12. Anyone wanting to run Salt daemons via a process supervisor such as `monit`_,
  13. `runit`_, or `supervisord`_, should omit the ``-d`` argument to the daemons and
  14. run them in the foreground.
  15. .. _`monit`: https://mmonit.com/monit/
  16. .. _`runit`: http://smarden.org/runit/
  17. .. _`supervisord`: http://supervisord.org/
  18. What Ports does the Master Need Open?
  19. =====================================
  20. For the master, TCP ports 4505 and 4506 need to be open. If you've put both
  21. your Salt master and minion in debug mode and don't see an acknowledgment
  22. that your minion has connected, it could very well be a firewall interfering
  23. with the connection. See our :ref:`firewall configuration
  24. <firewall>` page for help opening the firewall on various
  25. platforms.
  26. If you've opened the correct TCP ports and still aren't seeing connections,
  27. check that no additional access control system such as `SELinux`_ or
  28. `AppArmor`_ is blocking Salt.
  29. .. _`SELinux`: https://en.wikipedia.org/wiki/Security-Enhanced_Linux
  30. .. _`AppArmor`: https://gitlab.com/apparmor/apparmor/-/wikis/home
  31. Too many open files
  32. ===================
  33. The salt-master needs at least 2 sockets per host that connects to it, one for
  34. the Publisher and one for response port. Thus, large installations may, upon
  35. scaling up the number of minions accessing a given master, encounter:
  36. .. code-block:: console
  37. 12:45:29,289 [salt.master ][INFO ] Starting Salt worker process 38
  38. Too many open files
  39. sock != -1 (tcp_listener.cpp:335)
  40. The solution to this would be to check the number of files allowed to be
  41. opened by the user running salt-master (root by default):
  42. .. code-block:: bash
  43. [root@salt-master ~]# ulimit -n
  44. 1024
  45. If this value is not equal to at least twice the number of minions, then it
  46. will need to be raised. For example, in an environment with 1800 minions, the
  47. ``nofile`` limit should be set to no less than 3600. This can be done by
  48. creating the file ``/etc/security/limits.d/99-salt.conf``, with the following
  49. contents::
  50. root hard nofile 4096
  51. root soft nofile 4096
  52. Replace ``root`` with the user under which the master runs, if different.
  53. If your master does not have an ``/etc/security/limits.d`` directory, the lines
  54. can simply be appended to ``/etc/security/limits.conf``.
  55. As with any change to resource limits, it is best to stay logged into your
  56. current shell and open another shell to run ``ulimit -n`` again and verify that
  57. the changes were applied correctly. Additionally, if your master is running
  58. upstart, it may be necessary to specify the ``nofile`` limit in
  59. ``/etc/default/salt-master`` if upstart isn't respecting your resource limits:
  60. .. code-block:: text
  61. limit nofile 4096 4096
  62. .. note::
  63. The above is simply an example of how to set these values, and you may
  64. wish to increase them even further if your Salt master is doing more than
  65. just running Salt.
  66. Salt Master Stops Responding
  67. ============================
  68. There are known bugs with ZeroMQ versions less than 2.1.11 which can cause the
  69. Salt master to not respond properly. If you're running a ZeroMQ version greater
  70. than or equal to 2.1.9, you can work around the bug by setting the sysctls
  71. ``net.core.rmem_max`` and ``net.core.wmem_max`` to 16777216. Next, set the third
  72. field in ``net.ipv4.tcp_rmem`` and ``net.ipv4.tcp_wmem`` to at least 16777216.
  73. You can do it manually with something like:
  74. .. code-block:: bash
  75. # echo 16777216 > /proc/sys/net/core/rmem_max
  76. # echo 16777216 > /proc/sys/net/core/wmem_max
  77. # echo "4096 87380 16777216" > /proc/sys/net/ipv4/tcp_rmem
  78. # echo "4096 87380 16777216" > /proc/sys/net/ipv4/tcp_wmem
  79. Or with the following Salt state:
  80. .. code-block:: yaml
  81. :linenos:
  82. net.core.rmem_max:
  83. sysctl:
  84. - present
  85. - value: 16777216
  86. net.core.wmem_max:
  87. sysctl:
  88. - present
  89. - value: 16777216
  90. net.ipv4.tcp_rmem:
  91. sysctl:
  92. - present
  93. - value: 4096 87380 16777216
  94. net.ipv4.tcp_wmem:
  95. sysctl:
  96. - present
  97. - value: 4096 87380 16777216
  98. Live Python Debug Output
  99. ========================
  100. If the master seems to be unresponsive, a SIGUSR1 can be passed to the
  101. salt-master threads to display what piece of code is executing. This debug
  102. information can be invaluable in tracking down bugs.
  103. To pass a SIGUSR1 to the master, first make sure the master is running in the
  104. foreground. Stop the service if it is running as a daemon, and start it in the
  105. foreground like so:
  106. .. code-block:: bash
  107. # salt-master -l debug
  108. Then pass the signal to the master when it seems to be unresponsive:
  109. .. code-block:: bash
  110. # killall -SIGUSR1 salt-master
  111. When filing an issue or sending questions to the mailing list for a problem
  112. with an unresponsive daemon, be sure to include this information if possible.
  113. Live Salt-Master Profiling
  114. ==========================
  115. When faced with performance problems one can turn on master process profiling by
  116. sending it SIGUSR2.
  117. .. code-block:: bash
  118. # killall -SIGUSR2 salt-master
  119. This will activate ``yappi`` profiler inside salt-master code, then after some
  120. time one must send SIGUSR2 again to stop profiling and save results to file. If
  121. run in foreground salt-master will report filename for the results, which are
  122. usually located under ``/tmp`` on Unix-based OSes and ``c:\temp`` on windows.
  123. Make sure you have yappi installed.
  124. Results can then be analyzed with `kcachegrind`_ or similar tool.
  125. .. _`kcachegrind`: http://kcachegrind.sourceforge.net/html/Home.html
  126. Make sure you have yappi installed.
  127. On Windows, in the absence of kcachegrind, a simple file-based workflow to create
  128. profiling graphs could use `gprof2dot`_, `graphviz`_ and this batch file:
  129. .. _`gprof2dot`: https://pypi.org/project/gprof2dot
  130. .. _`graphviz`: https://graphviz.gitlab.io
  131. .. code-block:: batch
  132. ::
  133. :: Converts callgrind* profiler output to *.pdf, via *.dot
  134. ::
  135. @echo off
  136. del *.dot.pdf
  137. for /r %%f in (callgrind*) do (
  138. echo "%%f"
  139. gprof2dot.exe -f callgrind --show-samples "%%f" -o "%%f.dot"
  140. dot.exe "%%f.dot" -Tpdf -O
  141. del "%%f.dot"
  142. )
  143. Commands Time Out or Do Not Return Output
  144. =========================================
  145. Depending on your OS (this is most common on Ubuntu due to apt-get) you may
  146. sometimes encounter times where a :py:func:`state.apply
  147. <salt.modules.state.apply_>`, or other long running commands do not return
  148. output.
  149. By default the timeout is set to 5 seconds. The timeout value can easily be
  150. increased by modifying the ``timeout`` line within your ``/etc/salt/master``
  151. configuration file.
  152. Having keys accepted for Salt minions that no longer exist or are not reachable
  153. also increases the possibility of timeouts, since the Salt master waits for
  154. those systems to return command results.
  155. Passing the -c Option to Salt Returns a Permissions Error
  156. =========================================================
  157. Using the ``-c`` option with the Salt command modifies the configuration
  158. directory. When the configuration file is read it will still base data off of
  159. the ``root_dir`` setting. This can result in unintended behavior if you are
  160. expecting files such as ``/etc/salt/pki`` to be pulled from the location
  161. specified with ``-c``. Modify the ``root_dir`` setting to address this
  162. behavior.
  163. Salt Master Doesn't Return Anything While Running jobs
  164. ======================================================
  165. When a command being run via Salt takes a very long time to return
  166. (package installations, certain scripts, etc.) the master may drop you back
  167. to the shell. In most situations the job is still running but Salt has
  168. exceeded the set timeout before returning. Querying the job queue will
  169. provide the data of the job but is inconvenient. This can be resolved by
  170. either manually using the ``-t`` option to set a longer timeout when running
  171. commands (by default it is 5 seconds) or by modifying the master
  172. configuration file: ``/etc/salt/master`` and setting the ``timeout`` value to
  173. change the default timeout for all commands, and then restarting the
  174. salt-master service.
  175. If a ``state.apply`` run takes too long, you can find a bottleneck by adding the
  176. :py:mod:`--out=profile <salt.output.profile>` option.
  177. Salt Master Auth Flooding
  178. =========================
  179. In large installations, care must be taken not to overwhealm the master with
  180. authentication requests. Several options can be set on the master which
  181. mitigate the chances of an authentication flood from causing an interruption in
  182. service.
  183. .. note::
  184. recon_default:
  185. The average number of seconds to wait between reconnection attempts.
  186. recon_max:
  187. The maximum number of seconds to wait between reconnection attempts.
  188. recon_randomize:
  189. A flag to indicate whether the recon_default value should be randomized.
  190. acceptance_wait_time:
  191. The number of seconds to wait for a reply to each authentication request.
  192. random_reauth_delay:
  193. The range of seconds across which the minions should attempt to randomize
  194. authentication attempts.
  195. auth_timeout:
  196. The total time to wait for the authentication process to complete, regardless
  197. of the number of attempts.
  198. Running states locally
  199. ======================
  200. To debug the states, you can use call locally.
  201. .. code-block:: bash
  202. salt-call -l trace --local state.highstate
  203. The top.sls file is used to map what SLS modules get loaded onto what minions via the state system.
  204. It is located in the file defined in the ``file_roots`` variable of the salt master
  205. configuration file which is defined by found in ``CONFIG_DIR/master``, normally ``/etc/salt/master``
  206. The default configuration for the ``file_roots`` is:
  207. .. code-block:: yaml
  208. file_roots:
  209. base:
  210. - /srv/salt
  211. So the top file is defaulted to the location ``/srv/salt/top.sls``
  212. Salt Master Umask
  213. =================
  214. The salt master uses a cache to track jobs as they are published and returns come back.
  215. The recommended umask for a salt-master is `022`, which is the default for most users
  216. on a system. Incorrect umasks can result in permission-denied errors when the master
  217. tries to access files in its cache.