123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312 |
- .. _troubleshooting-salt-master:
- ===============================
- Troubleshooting the Salt Master
- ===============================
- Running in the Foreground
- =========================
- A great deal of information is available via the debug logging system, if you
- are having issues with minions connecting or not starting run the master in
- the foreground:
- .. code-block:: bash
- # salt-master -l debug
- Anyone wanting to run Salt daemons via a process supervisor such as `monit`_,
- `runit`_, or `supervisord`_, should omit the ``-d`` argument to the daemons and
- run them in the foreground.
- .. _`monit`: https://mmonit.com/monit/
- .. _`runit`: http://smarden.org/runit/
- .. _`supervisord`: http://supervisord.org/
- What Ports does the Master Need Open?
- =====================================
- For the master, TCP ports 4505 and 4506 need to be open. If you've put both
- your Salt master and minion in debug mode and don't see an acknowledgment
- that your minion has connected, it could very well be a firewall interfering
- with the connection. See our :ref:`firewall configuration
- <firewall>` page for help opening the firewall on various
- platforms.
- If you've opened the correct TCP ports and still aren't seeing connections,
- check that no additional access control system such as `SELinux`_ or
- `AppArmor`_ is blocking Salt.
- .. _`SELinux`: https://en.wikipedia.org/wiki/Security-Enhanced_Linux
- .. _`AppArmor`: https://gitlab.com/apparmor/apparmor/-/wikis/home
- Too many open files
- ===================
- The salt-master needs at least 2 sockets per host that connects to it, one for
- the Publisher and one for response port. Thus, large installations may, upon
- scaling up the number of minions accessing a given master, encounter:
- .. code-block:: console
- 12:45:29,289 [salt.master ][INFO ] Starting Salt worker process 38
- Too many open files
- sock != -1 (tcp_listener.cpp:335)
- The solution to this would be to check the number of files allowed to be
- opened by the user running salt-master (root by default):
- .. code-block:: bash
- [root@salt-master ~]# ulimit -n
- 1024
- If this value is not equal to at least twice the number of minions, then it
- will need to be raised. For example, in an environment with 1800 minions, the
- ``nofile`` limit should be set to no less than 3600. This can be done by
- creating the file ``/etc/security/limits.d/99-salt.conf``, with the following
- contents::
- root hard nofile 4096
- root soft nofile 4096
- Replace ``root`` with the user under which the master runs, if different.
- If your master does not have an ``/etc/security/limits.d`` directory, the lines
- can simply be appended to ``/etc/security/limits.conf``.
- As with any change to resource limits, it is best to stay logged into your
- current shell and open another shell to run ``ulimit -n`` again and verify that
- the changes were applied correctly. Additionally, if your master is running
- upstart, it may be necessary to specify the ``nofile`` limit in
- ``/etc/default/salt-master`` if upstart isn't respecting your resource limits:
- .. code-block:: text
- limit nofile 4096 4096
- .. note::
- The above is simply an example of how to set these values, and you may
- wish to increase them even further if your Salt master is doing more than
- just running Salt.
- Salt Master Stops Responding
- ============================
- There are known bugs with ZeroMQ versions less than 2.1.11 which can cause the
- Salt master to not respond properly. If you're running a ZeroMQ version greater
- than or equal to 2.1.9, you can work around the bug by setting the sysctls
- ``net.core.rmem_max`` and ``net.core.wmem_max`` to 16777216. Next, set the third
- field in ``net.ipv4.tcp_rmem`` and ``net.ipv4.tcp_wmem`` to at least 16777216.
- You can do it manually with something like:
- .. code-block:: bash
- # echo 16777216 > /proc/sys/net/core/rmem_max
- # echo 16777216 > /proc/sys/net/core/wmem_max
- # echo "4096 87380 16777216" > /proc/sys/net/ipv4/tcp_rmem
- # echo "4096 87380 16777216" > /proc/sys/net/ipv4/tcp_wmem
- Or with the following Salt state:
- .. code-block:: yaml
- :linenos:
- net.core.rmem_max:
- sysctl:
- - present
- - value: 16777216
- net.core.wmem_max:
- sysctl:
- - present
- - value: 16777216
- net.ipv4.tcp_rmem:
- sysctl:
- - present
- - value: 4096 87380 16777216
- net.ipv4.tcp_wmem:
- sysctl:
- - present
- - value: 4096 87380 16777216
- Live Python Debug Output
- ========================
- If the master seems to be unresponsive, a SIGUSR1 can be passed to the
- salt-master threads to display what piece of code is executing. This debug
- information can be invaluable in tracking down bugs.
- To pass a SIGUSR1 to the master, first make sure the master is running in the
- foreground. Stop the service if it is running as a daemon, and start it in the
- foreground like so:
- .. code-block:: bash
- # salt-master -l debug
- Then pass the signal to the master when it seems to be unresponsive:
- .. code-block:: bash
- # killall -SIGUSR1 salt-master
- When filing an issue or sending questions to the mailing list for a problem
- with an unresponsive daemon, be sure to include this information if possible.
- Live Salt-Master Profiling
- ==========================
- When faced with performance problems one can turn on master process profiling by
- sending it SIGUSR2.
- .. code-block:: bash
- # killall -SIGUSR2 salt-master
- This will activate ``yappi`` profiler inside salt-master code, then after some
- time one must send SIGUSR2 again to stop profiling and save results to file. If
- run in foreground salt-master will report filename for the results, which are
- usually located under ``/tmp`` on Unix-based OSes and ``c:\temp`` on windows.
- Make sure you have yappi installed.
- Results can then be analyzed with `kcachegrind`_ or similar tool.
- .. _`kcachegrind`: http://kcachegrind.sourceforge.net/html/Home.html
- Make sure you have yappi installed.
- On Windows, in the absence of kcachegrind, a simple file-based workflow to create
- profiling graphs could use `gprof2dot`_, `graphviz`_ and this batch file:
- .. _`gprof2dot`: https://pypi.org/project/gprof2dot
- .. _`graphviz`: https://graphviz.gitlab.io
- .. code-block:: batch
- ::
- :: Converts callgrind* profiler output to *.pdf, via *.dot
- ::
- @echo off
- del *.dot.pdf
- for /r %%f in (callgrind*) do (
- echo "%%f"
- gprof2dot.exe -f callgrind --show-samples "%%f" -o "%%f.dot"
- dot.exe "%%f.dot" -Tpdf -O
- del "%%f.dot"
- )
- Commands Time Out or Do Not Return Output
- =========================================
- Depending on your OS (this is most common on Ubuntu due to apt-get) you may
- sometimes encounter times where a :py:func:`state.apply
- <salt.modules.state.apply_>`, or other long running commands do not return
- output.
- By default the timeout is set to 5 seconds. The timeout value can easily be
- increased by modifying the ``timeout`` line within your ``/etc/salt/master``
- configuration file.
- Having keys accepted for Salt minions that no longer exist or are not reachable
- also increases the possibility of timeouts, since the Salt master waits for
- those systems to return command results.
- Passing the -c Option to Salt Returns a Permissions Error
- =========================================================
- Using the ``-c`` option with the Salt command modifies the configuration
- directory. When the configuration file is read it will still base data off of
- the ``root_dir`` setting. This can result in unintended behavior if you are
- expecting files such as ``/etc/salt/pki`` to be pulled from the location
- specified with ``-c``. Modify the ``root_dir`` setting to address this
- behavior.
- Salt Master Doesn't Return Anything While Running jobs
- ======================================================
- When a command being run via Salt takes a very long time to return
- (package installations, certain scripts, etc.) the master may drop you back
- to the shell. In most situations the job is still running but Salt has
- exceeded the set timeout before returning. Querying the job queue will
- provide the data of the job but is inconvenient. This can be resolved by
- either manually using the ``-t`` option to set a longer timeout when running
- commands (by default it is 5 seconds) or by modifying the master
- configuration file: ``/etc/salt/master`` and setting the ``timeout`` value to
- change the default timeout for all commands, and then restarting the
- salt-master service.
- If a ``state.apply`` run takes too long, you can find a bottleneck by adding the
- :py:mod:`--out=profile <salt.output.profile>` option.
- Salt Master Auth Flooding
- =========================
- In large installations, care must be taken not to overwhealm the master with
- authentication requests. Several options can be set on the master which
- mitigate the chances of an authentication flood from causing an interruption in
- service.
- .. note::
- recon_default:
- The average number of seconds to wait between reconnection attempts.
- recon_max:
- The maximum number of seconds to wait between reconnection attempts.
- recon_randomize:
- A flag to indicate whether the recon_default value should be randomized.
- acceptance_wait_time:
- The number of seconds to wait for a reply to each authentication request.
- random_reauth_delay:
- The range of seconds across which the minions should attempt to randomize
- authentication attempts.
- auth_timeout:
- The total time to wait for the authentication process to complete, regardless
- of the number of attempts.
- Running states locally
- ======================
- To debug the states, you can use call locally.
- .. code-block:: bash
- salt-call -l trace --local state.highstate
- The top.sls file is used to map what SLS modules get loaded onto what minions via the state system.
- It is located in the file defined in the ``file_roots`` variable of the salt master
- configuration file which is defined by found in ``CONFIG_DIR/master``, normally ``/etc/salt/master``
- The default configuration for the ``file_roots`` is:
- .. code-block:: yaml
- file_roots:
- base:
- - /srv/salt
- So the top file is defaulted to the location ``/srv/salt/top.sls``
- Salt Master Umask
- =================
- The salt master uses a cache to track jobs as they are published and returns come back.
- The recommended umask for a salt-master is `022`, which is the default for most users
- on a system. Incorrect umasks can result in permission-denied errors when the master
- tries to access files in its cache.
|