Opened 4 years ago

Closed 3 years ago

#26608 closed defect (fixed)

Docbuild segfaults when pari is compiled with threading

Reported by: gh-timokau Owned by: embray
Priority: major Milestone: sage-pending
Component: documentation Keywords: docbuild, pari
Cc: arojas, jdemeyer, fbissey, dimpase, saraedum, embray Merged in:
Authors: Reviewers:
Report Upstream: N/A Work issues:
Branch: Commit:
Dependencies: Stopgaps:

Status badges

Description (last modified by embray)

This ticket is a followup to this sage-packaging discussion. To summarize:

sage does not work together with pari's threading. Instead of relying on it being compiled without threading, I made use of the "nthreads" option to disable threading at runtime in #26002.

However since #24655 (unconditionally enabling threaded docbuild), the docbuild segfaults when pari is compiled with threading support. Apparently sage somehow uses pari while ignoring the nthread option. We get the following backtrace (provided by Antonio with an older version of sage):

Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 113, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 65, in mapstar
    return map(*args)
  File "/build/sagemath-doc/src/sage-8.0/local-python/sage_setup/docbuild/__init__.py", line 70, in build_ref_doc
    getattr(ReferenceSubBuilder(doc, lang), format)(*args, **kwds)
  File "/build/sagemath-doc/src/sage-8.0/local-python/sage_setup/docbuild/__init__.py", line 720, in _wrapper
    getattr(DocBuilder, build_type)(self, *args, **kwds)
  File "/build/sagemath-doc/src/sage-8.0/local-python/sage_setup/docbuild/__init__.py", line 104, in f
    runsphinx()
  File "/build/sagemath-doc/src/sage-8.0/local-python/sage_setup/docbuild/sphinxbuild.py", line 207, in runsphinx
    sphinx.cmdline.main(sys.argv)
  File "/usr/lib/python2.7/site-packages/sphinx/cmdline.py", line 296, in main
    app.build(opts.force_all, filenames)
  File "/usr/lib/python2.7/site-packages/sphinx/application.py", line 333, in build
    self.builder.build_update()
  File "/usr/lib/python2.7/site-packages/sphinx/builders/__init__.py", line 251, in build_update
    'out of date' % len(to_build))
  File "/usr/lib/python2.7/site-packages/sphinx/builders/__init__.py", line 265, in build
    self.doctreedir, self.app))
  File "/usr/lib/python2.7/site-packages/sphinx/environment/__init__.py", line 549, in update
    self._read_serial(docnames, app)
  File "/usr/lib/python2.7/site-packages/sphinx/environment/__init__.py", line 569, in _read_serial
    self.read_doc(docname, app)
  File "/usr/lib/python2.7/site-packages/sphinx/environment/__init__.py", line 677, in read_doc
    pub.publish()
  File "/usr/lib/python2.7/site-packages/docutils/core.py", line 217, in publish
    self.settings)
  File "/usr/lib/python2.7/site-packages/sphinx/io.py", line 55, in read
    self.parse()
  File "/usr/lib/python2.7/site-packages/docutils/readers/__init__.py", line 78, in parse
    self.parser.parse(self.input, document)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/__init__.py", line 191, in parse
    self.statemachine.run(inputlines, document, inliner=self.inliner)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 171, in run
    input_source=document['source'])
  File "/usr/lib/python2.7/site-packages/docutils/statemachine.py", line 239, in run
    context, state, transitions)
  File "/usr/lib/python2.7/site-packages/docutils/statemachine.py", line 460, in check_line
    return method(match, context, next_state)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 2753, in underline
    self.section(title, source, style, lineno - 1, messages)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 327, in section
    self.new_subsection(title, lineno, messages)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 395, in new_subsection
    node=section_node, match_titles=True)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 282, in nested_parse
    node=node, match_titles=match_titles)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 196, in run
    results = StateMachineWS.run(self, input_lines, input_offset)
  File "/usr/lib/python2.7/site-packages/docutils/statemachine.py", line 239, in run
    context, state, transitions)
  File "/usr/lib/python2.7/site-packages/docutils/statemachine.py", line 460, in check_line
    return method(match, context, next_state)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 2328, in explicit_markup
    self.explicit_list(blank_finish)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 2358, in explicit_list
    match_titles=self.state_machine.match_titles)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 319, in nested_list_parse
    node=node, match_titles=match_titles)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 196, in run
    results = StateMachineWS.run(self, input_lines, input_offset)
  File "/usr/lib/python2.7/site-packages/docutils/statemachine.py", line 239, in run
    context, state, transitions)
  File "/usr/lib/python2.7/site-packages/docutils/statemachine.py", line 460, in check_line
    return method(match, context, next_state)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 2631, in explicit_markup
    nodelist, blank_finish = self.explicit_construct(match)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 2338, in explicit_construct
    return method(self, expmatch)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 2081, in directive
    directive_class, match, type_name, option_presets)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 2130, in run_directive
    result = directive_instance.run()
  File "/build/sagemath-doc/src/sage-8.0/src/sage_setup/docbuild/ext/sage_autodoc.py", line 1749, in run
    nested_parse_with_titles(self.state, self.result, node)
  File "/usr/lib/python2.7/site-packages/sphinx/util/nodes.py", line 208, in nested_parse_with_titles
    return state.nested_parse(content, 0, node, match_titles=1)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 282, in nested_parse
    node=node, match_titles=match_titles)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 196, in run
    results = StateMachineWS.run(self, input_lines, input_offset)
  File "/usr/lib/python2.7/site-packages/docutils/statemachine.py", line 239, in run
    context, state, transitions)
  File "/usr/lib/python2.7/site-packages/docutils/statemachine.py", line 460, in check_line
    return method(match, context, next_state)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 2326, in explicit_markup
    nodelist, blank_finish = self.explicit_construct(match)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 2338, in explicit_construct
    return method(self, expmatch)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 2081, in directive
    directive_class, match, type_name, option_presets)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 2130, in run_directive
    result = directive_instance.run()
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/__init__.py", line 410, in run
    self.state, self.state_machine)
  File "/usr/lib/python2.7/site-packages/matplotlib/sphinxext/plot_directive.py", line 189, in plot_directive
    return run(arguments, content, options, state_machine, state, lineno)
  File "/usr/lib/python2.7/site-packages/matplotlib/sphinxext/plot_directive.py", line 779, in run
    close_figs=context_opt == 'close-figs')
  File "/usr/lib/python2.7/site-packages/matplotlib/sphinxext/plot_directive.py", line 644, in render_figures
    run_code(code_piece, code_path, ns, function_name)
  File "/usr/lib/python2.7/site-packages/matplotlib/sphinxext/plot_directive.py", line 524, in run_code
    six.exec_(code, ns)
  File "/usr/lib/python2.7/site-packages/six.py", line 709, in exec_
    exec("""exec _code_ in _globs_, _locs_""")
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "sage/misc/classcall_metaclass.pyx", line 329, in sage.misc.classcall_metaclass.ClasscallMetaclass.__call__ (build/cythonized/sage/misc/classcall_metaclass.c:1698)
    if cls.classcall is not None:
  File "/usr/lib/python2.7/site-packages/sage/geometry/triangulation/point_configuration.py", line 331, in __classcall__
    .__classcall__(cls, points, connected, fine, regular, star, defined_affine)
  File "sage/misc/cachefunc.pyx", line 1005, in sage.misc.cachefunc.CachedFunction.__call__ (build/cythonized/sage/misc/cachefunc.c:6065)
    ArgSpec(args=['self', 'algorithm', 'deg_bound', 'mult_bound', 'prot'],
  File "/usr/lib/python2.7/site-packages/sage/structure/unique_representation.py", line 1027, in __classcall__
    instance = typecall(cls, *args, **options)
  File "sage/misc/classcall_metaclass.pyx", line 496, in sage.misc.classcall_metaclass.typecall (build/cythonized/sage/misc/classcall_metaclass.c:2148)
    """
  File "/usr/lib/python2.7/site-packages/sage/geometry/triangulation/point_configuration.py", line 367, in __init__
    PointConfiguration_base.__init__(self, points, defined_affine)
  File "sage/geometry/triangulation/base.pyx", line 398, in sage.geometry.triangulation.base.PointConfiguration_base.__init__ (build/cythonized/sage/geometry/triangulation/base.cpp:4135)
    self._init_points(points)
  File "sage/geometry/triangulation/base.pyx", line 456, in sage.geometry.triangulation.base.PointConfiguration_base._init_points (build/cythonized/sage/geometry/triangulation/base.cpp:4982)
    red = matrix([ red.row(i) for i in red.pivot_rows()])
  File "sage/matrix/matrix2.pyx", line 517, in sage.matrix.matrix2.Matrix.pivot_rows (build/cythonized/sage/matrix/matrix2.c:8414)
    """
  File "sage/matrix/matrix_integer_dense.pyx", line 2217, in sage.matrix.matrix_integer_dense.Matrix_integer_dense.pivots (build/cythonized/sage/matrix/matrix_integer_dense.c:19162)
    sage: matrix(3, range(9)).elementary_divisors()
  File "sage/matrix/matrix_integer_dense.pyx", line 2019, in sage.matrix.matrix_integer_dense.Matrix_integer_dense.echelon_form (build/cythonized/sage/matrix/matrix_integer_dense.c:17749)
   
  File "sage/matrix/matrix_integer_dense.pyx", line 5719, in sage.matrix.matrix_integer_dense.Matrix_integer_dense._hnf_pari (build/cythonized/sage/matrix/matrix_integer_dense.c:46635)
    most `\max\mathcal{S}` where `\mathcal{S}` denotes the full
SignalError: Segmentation fault 

That shows us that src/sage/matrix/matrix_integer_dense.pyx is involved. Apparently that file directly uses cypari c-bindings instead of the libs/pari.py interface (where the nthreads option is added). For example:

def LLL_gram(self, flag = 0):
    if self._nrows != self._ncols:
        raise ArithmeticError("self must be a square matrix")

    n = self.nrows()
    # maybe should be /unimodular/ matrices ?
    P = self.__pari__()
    try:
        U = P.qflllgram(flag)
    except (RuntimeError, ArithmeticError) as msg:
        raise ValueError("qflllgram failed, "
                         "perhaps the matrix is not positive definite")
    if U.matsize() != [n, n]:
        raise ValueError("qflllgram did not return a square matrix, "
                         "perhaps the matrix is not positive definite");
    MS = matrix_space.MatrixSpace(ZZ,n)
    U = MS(U.sage())
    # Fix last column so that det = +1
    if U.det() == -1:
        for i in range(n):
            U[i,n-1] = - U[i,n-1]
    return U

Can someone more familiar with cython and cypari tell if the options defined in libs/pari.py would apply here? Why isn't libs/pari.py used?

Change History (45)

comment:1 Changed 4 years ago by embray

I think there's a little bit of misinformation / misconception here.

There's nothing about Sage's docbuild program that uses multi-threading. It uses a process pool and builds each sub-document in separate processes.

(There are some cases where it does not run builds in subprocesses when it probably should, and I think that is contributing somewhat to the explosion of memory usage in the docbuild, but that's a separate issue).

comment:2 Changed 4 years ago by embray

I don't know if PARI uses openblas in its multi-threaded mode but I wonder if this is related to #26585

comment:3 Changed 4 years ago by arojas

Note that the code excerpts in the last lines of the backtrace are nonsense since I was compiling an older version of the docs. Here's the "translated" version:

  File "sage/misc/cachefunc.pyx", line 1005, in sage.misc.cachefunc.CachedFunction.__call__ (build/cythonized/sage/misc/cachefunc.c:6065)
    w = self.f(*args, **kwds)
  File "/usr/lib/python2.7/site-packages/sage/structure/unique_representation.py", line 1027, in __classcall__
    instance = typecall(cls, *args, **options)
  File "sage/misc/classcall_metaclass.pyx", line 496, in sage.misc.classcall_metaclass.typecall (build/cythonized/sage/misc/classcall_metaclass.c:2148)
    return (<PyTypeObject*>type).tp_call(cls, args, kwds)
  File "/usr/lib/python2.7/site-packages/sage/geometry/triangulation/point_configuration.py", line 367, in __init__
    PointConfiguration_base.__init__(self, points, defined_affine)
  File "sage/geometry/triangulation/base.pyx", line 398, in sage.geometry.triangulation.base.PointConfiguration_base.__init__ (build/cythonized/sage/geometry/triangulation/base.cpp:4135)
    self._init_points(points)
  File "sage/geometry/triangulation/base.pyx", line 456, in sage.geometry.triangulation.base.PointConfiguration_base._init_points (build/cythonized/sage/geometry/triangulation/base.cpp:4982)
    red = matrix([ red.row(i) for i in red.pivot_rows()])
  File "sage/matrix/matrix2.pyx", line 517, in sage.matrix.matrix2.Matrix.pivot_rows (build/cythonized/sage/matrix/matrix2.c:8414)
    v = self.transpose().pivots()
  File "sage/matrix/matrix_integer_dense.pyx", line 2217, in sage.matrix.matrix_integer_dense.Matrix_integer_dense.pivots (build/cythonized/sage/matrix/matrix_integer_dense.c:19162)
    E = self.echelon_form()
  File "sage/matrix/matrix_integer_dense.pyx", line 2019, in sage.matrix.matrix_integer_dense.Matrix_integer_dense.echelon_form (build/cythonized/sage/matrix/matrix_integer_dense.c:17749)
    H_m = self._hnf_pari(flag, include_zero_rows=include_zero_rows)
  File "sage/matrix/matrix_integer_dense.pyx", line 5719, in sage.matrix.matrix_integer_dense.Matrix_integer_dense._hnf_pari (build/cythonized/sage/matrix/matrix_integer_dense.c:46635)
    sig_on()
SignalError: Segmentation fault 

comment:4 Changed 4 years ago by gh-timokau

Description: modified (diff)

comment:5 in reply to:  2 ; Changed 4 years ago by gh-timokau

Replying to embray:

I don't know if PARI uses openblas in its multi-threaded mode but I wonder if this is related to #26585

I'll test if that openblas patch fixes it. Very interesting ticket, I wonder if that also causes #26130 (I've heard darwin is somewhat more prone to threading bugs).

Last edited 4 years ago by gh-timokau (previous) (diff)

comment:6 Changed 4 years ago by embray

I'm not sure if it does, but it might. I grepped the pari/gp source and it doesn't use openblas_set_num_threads directly, but something else might be. I did see some reference to omp_set_num_threads but I don't think we compile with OpenMP by default in Sage.

comment:7 Changed 4 years ago by embray

There's also some multi-threading support in FLINT which could be problematic, but I have no idea if that's relevant in this case.

comment:8 Changed 4 years ago by embray

I read on the mailing list post "It is called indirectly via matplotlib when rendering plots, see full backtrace below (btw, I had to downgrade to an old version of Sage to get a meaningful backtrace - I really dislike this trend of hiding build output, it makes it very hard to debug stuff)"

I had a very similar problem to this; it actually came from the BLAS library by way of a Numpy ufunc (I think for the "dot" product of a matrix in a vector, or two matrices). I feel like I actually fixed this but now I can't remember.

comment:9 Changed 4 years ago by embray

Do you know some direct, specific way to reproduce this so that I can try it?

comment:10 in reply to:  5 ; Changed 4 years ago by embray

Replying to gh-timokau:

Replying to embray:

I don't know if PARI uses openblas in its multi-threaded mode but I wonder if this is related to #26585

I'll test if that openblas patch fixes it. Very interesting ticket, I wonder if that also causes #26130 (I've heard darwin is somewhat more prone to threading bugs).

I don't think it's related, because this only showed up when we patched fflas-ffpack to allow configuring the number of threads to use with openblas (by default it just sets it to 1). But conceivably there's a similar bug elsewhere. Possibly related to fork(). I have found many bugs in different projects related to threads/fork interaction.

comment:11 Changed 4 years ago by embray

You know what though--I'm looking at the relevant code in openblas, and openblas_set_num_threads(1) might actually cause it to spin up a single thread (beyond the main thread) which is actually good enough to invoke the bug I fixed. I'm going to try to confirm that though.

comment:12 in reply to:  6 Changed 4 years ago by jdemeyer

Replying to embray:

I'm not sure if it does, but it might. I grepped the pari/gp source and it doesn't use openblas_set_num_threads directly, but something else might be.

I don't think that PARI uses BLAS in any way.

comment:13 Changed 4 years ago by jdemeyer

Description: modified (diff)

comment:14 in reply to:  10 ; Changed 4 years ago by jdemeyer

Replying to embray:

I have found many bugs in different projects related to threads/fork interaction.

I'm also betting on this. PARI might setup some data structures related to threading (when compiled with threading support) which are invalid when running in a forked child process.

comment:15 Changed 4 years ago by embray

Description: modified (diff)

Nevermind; that does not appear to be the case, I don't think.

comment:16 in reply to:  14 Changed 4 years ago by embray

Replying to jdemeyer:

Replying to embray:

I have found many bugs in different projects related to threads/fork interaction.

I'm also betting on this. PARI might setup some data structures related to threading (when compiled with threading support) which are invalid when running in a forked child process.

Yes, I think you must be right. PARI has its own thread management, and it does not implement any pthread_atfork handler that I can find, which is strong cause to suspect it...

comment:17 Changed 4 years ago by embray

ISTM PARI/GP is not even built with multi-threading enabled unless you run its Configure with either --mt=pthread or --mt=mpi. On my system anyways this is not done by default...

Last edited 4 years ago by embray (previous) (diff)

comment:18 Changed 4 years ago by embray

Now if I build PARI with --mt=pthread I can make a child process segfault if it tries to do some multi-threaded work. Granted, this is partly with my own bad code which does some things improperly. Now I'd be curious if

a) Anyone getting having this problem is building PARI with --mt=pthread and

b) Exactly what code is being run in Sage that invokes multi-threading in PARI.

Last edited 4 years ago by embray (previous) (diff)

comment:19 in reply to:  17 Changed 4 years ago by gh-timokau

Replying to embray:

ISTM PARI/GP is not even built with multi-threading enabled unless you run its Configure with either --mt=pthread or --mt=mpi. On my system anyways this is not done by default...

Yes, that is why this issue came up on sage-packaging. Some distros ship pari with threading enabled. Sage does not. My effort in #26002 was not to change that, but to make sage compatible with a system pari.

comment:20 Changed 4 years ago by embray

I can also make it deadlock with the right combination of evil calls.

comment:21 Changed 4 years ago by gh-timokau

Even when you explicitly set nthreads to 1 before going forth with that evilness?

comment:22 Changed 4 years ago by embray

I don't know. I've been sick for a the last week so I completely forget exactly where I left this. Nevertheless, now I know that unless I compile pari with --mt=pthread the problem won't occur at all. So I'll try doing that again, and then see if I can figure out exactly what code in Sage exhibits the bug. That will help me pinpoint it.

comment:23 Changed 4 years ago by gh-timokau

Okay, glad you're feeling better :)

Let me know if I can help.

comment:24 Changed 4 years ago by embray

Milestone: sage-8.5sage-8.7
Owner: set to embray

Totally forgot about this...

comment:25 Changed 4 years ago by jdemeyer

Too bad, I also forgot about this. I'm literally right now returning from a week-long PARI/GP workshop where I could have discussed this.

comment:26 Changed 4 years ago by arojas

The alternative multiprocess doc build introduced in #27490 works as a temporary workaround. I just replaced the revert with a one-line patch to enable it unconditionally for 8.7

comment:27 Changed 4 years ago by embray

Milestone: sage-8.7sage-pending

Removing most of the rest of my open tickets out of the 8.7 milestone, which should be closed.

comment:28 Changed 4 years ago by dimpase

see also #28242, where we enable system Pari in vanilla Sage.

comment:29 in reply to:  18 Changed 4 years ago by embray

Replying to embray:

Now if I build PARI with --mt=pthread I can make a child process segfault if it tries to do some multi-threaded work. Granted, this is partly with my own bad code which does some things improperly. Now I'd be curious if

I wish I knew what I meant by this, because I want to investigate this again but I don't have a clear way yet to even reproduce the issue, and I can't find whatever example code this might be referring to... :(

comment:30 Changed 4 years ago by embray

Some PARI notes regarding threading:

  • Setting --mt=pthread in PARI's Configure script results in it compiling a test file called config/pthread.c, and if that succeeds it sets a variable thread_engine=pthread and enable_tls=yes, and should print "Using mt engine pthread" (similar result if you use mpi, but for now I'm just looking at pthread).
  • This also outputs the macro #define PARI_MT_ENGINE "pthread" in the generated paricfg.h.
    • In src/language/paricfg.c it sets a global constant const char *paricfg_mt_engine = PARI_MT_ENGINE;. This is only used when printing the version info like
      $ gp --version
                                              GP/PARI CALCULATOR Version 2.11.1 (released)
                                      amd64 running linux (x86-64/GMP-6.0.0 kernel) 64-bit version
                                compiled: Aug  7 2019, gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.4)
                                                        threading engine: pthread
                                             (readline v6.3 disabled, extended help enabled)
      
  • In the generated Makefile this also adds some bits, such as MT_LIBS=-lpthread, the file parimt.h contains the contents of src/mt/pthread.h, and the module src/mt/pthread.c which implements an standard interface (including the pari_mt_init function), is built, alongside some generic code in src/mt/mt.c
    • If instead we'd had --mt=single (the default) it would compile src/mt/single.c which implements the same interfaces, mostly as no-ops (its pari_mt_init just sets the global variable pari_mt_nbthreads = 1).
  • In src/mt/pthread.c there is a global pointer declared static struct mt_pstate *pari_mt. This is a pointer to a struct mt_pstate which contains several variables related to the state of pari's thread pool, including pointers to the threads themselves.
    • When functions like mt_queue_start() or mt_queue_start_lim() (which mt_queue_start() is just a wrapper for) it initializes one of these mt_pstate structs, and sets the global pari_mt to point to it.
    • The global pari_mt is also referenced in mt_queue_reset().
    • pari_mt also contains a reference to one mutex, which is used when calling mt_queue_get to synchronize access to per-thread results in such a way that it blocks until a result is available (but also allows interruption by sigint).
Last edited 4 years ago by embray (previous) (diff)

comment:31 Changed 4 years ago by dimpase

and Sage installs paricfg.h with

#define PARI_MT_ENGINE "single"

Well, I can use this on #28242 to test whether we got single-threaded libpari.

comment:32 Changed 4 years ago by embray

Managed to reproduce the problem again by doing a parallel docbuild after building PARI with --mt=pthread, which eventually gives a segfault.

Unfortunately the build_many function doesn't give us any info on the traceback from the original exception, so I added the following patch to the docbuild code so it would at least print the traceback:

  • src/sage_setup/docbuild/__init__.py

    diff --git a/src/sage_setup/docbuild/__init__.py b/src/sage_setup/docbuild/__init__.py
    index e406bca..4912b5c 100644
    a b import shutil 
    4949import subprocess
    5050import sys
    5151import time
     52import traceback
    5253import warnings
    5354
    5455logger = logging.getLogger(__name__)
    def builder_helper(type): 
    136137            if ABORT_ON_ERROR:
    137138                raise
    138139        except BaseException as e:
     140            exc_type, exc_value, exc_traceback = sys.exc_info()
     141            traceback.print_tb(exc_traceback)
    139142            # We need to wrap a BaseException that is not an Exception in a
    140143            # regular Exception. Otherwise multiprocessing.Pool.get hangs, see
    141144            # #25161

With that, I get a long traceback (much of which is stuff in Sphinx that isn't interesting). But as previously reported--and unsurprisingly--the segfault originates from some code for a plot:

  File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/matplotlib/sphinxext/plot_directive.py", line 644, in render_figures
    run_code(code_piece, code_path, ns, function_name)
  File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/matplotlib/sphinxext/plot_directive.py", line 524, in run_code
    six.exec_(code, ns)
  File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/six.py", line 709, in exec_
    exec("""exec _code_ in _globs_, _locs_""")
  File "<string>", line 1, in <module>
  File "<string>", line 2, in <module>
  File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/categories/finite_coxeter_groups.py", line 750, in permutahedron
    vertices = [v.change_ring(AA) for v in vertices]
  File "sage/modules/free_module_element.pyx", line 1495, in sage.modules.free_module_element.FreeModuleElement.change_ring (build/cythonized/sage/modules/free_module_element.c:11182)
    return M(self.list(), coerce=True)
  File "sage/structure/parent.pyx", line 902, in sage.structure.parent.Parent.__call__ (build/cythonized/sage/structure/parent.c:9225)
    return mor._call_with_args(x, args, kwds)
  File "sage/structure/coerce_maps.pyx", line 171, in sage.structure.coerce_maps.DefaultConvertMap_unique._call_with_args (build/cythonized/sage/structure/coerce_maps.c:4872)
    return C._element_constructor(x, **kwds)
  File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/modules/free_module.py", line 5601, in _element_constructor_
    return FreeModule_generic_field._element_constructor_(self, e, *args, **kwds)
  File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/modules/free_module.py", line 1028, in _element_constructor_
    return self.element_class(self, x, coerce, copy)
  File "sage/modules/free_module_element.pyx", line 4119, in sage.modules.free_module_element.FreeModuleElement_generic_dense.__init__ (build/cythonized/sage/modules/free_module_element.c:28564)
    entries = [coefficient_ring(x) for x in entries]
  File "sage/structure/parent.pyx", line 900, in sage.structure.parent.Parent.__call__ (build/cythonized/sage/structure/parent.c:9198)
    return mor._call_(x)
  File "sage/structure/coerce_maps.pyx", line 157, in sage.structure.coerce_maps.DefaultConvertMap_unique._call_ (build/cythonized/sage/structure/coerce_maps.c:4449)
    return C._element_constructor(x)
  File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/rings/qqbar.py", line 759, in _element_constructor_
    return x._algebraic_(AA)
  File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/rings/universal_cyclotomic_field.py", line 606, in _algebraic_
    return R(QQbar(self))
  File "sage/structure/parent.pyx", line 900, in sage.structure.parent.Parent.__call__ (build/cythonized/sage/structure/parent.c:9198)
    return mor._call_(x)
  File "sage/categories/map.pyx", line 789, in sage.categories.map.Map._call_ (build/cythonized/sage/categories/map.c:6953)
    cpdef Element _call_(self, x):
  File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/rings/universal_cyclotomic_field.py", line 260, in _call_
    zeta = QQbar.zeta(k)
  File "sage/misc/cachefunc.pyx", line 1949, in sage.misc.cachefunc.CachedMethodCaller.__call__ (build/cythonized/sage/misc/cachefunc.c:10274)
    w = self._instance_call(*args, **kwds)
  File "sage/misc/cachefunc.pyx", line 1825, in sage.misc.cachefunc.CachedMethodCaller._instance_call (build/cythonized/sage/misc/cachefunc.c:9759)
    return self.f(self._instance, *args, **kwds)
  File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/rings/qqbar.py", line 1366, in zeta
    nf = CyclotomicField(n)
  File "sage/structure/factory.pyx", line 369, in sage.structure.factory.UniqueFactory.__call__ (build/cythonized/sage/structure/factory.c:2146)
    return self.get_object(version, key, kwds)
  File "sage/structure/factory.pyx", line 406, in sage.structure.factory.UniqueFactory.get_object (build/cythonized/sage/structure/factory.c:2350)
    return self._cache[version, cache_key]
  File "sage/misc/weak_dict.pyx", line 704, in sage.misc.weak_dict.WeakValueDictionary.__getitem__ (build/cythonized/sage/misc/weak_dict.c:3653)
    cdef PyObject* wr = PyDict_GetItemWithError(self, k)
  File "sage/cpython/dict_del_by_value.pyx", line 58, in sage.cpython.dict_del_by_value.PyDict_GetItemWithError (build/cythonized/sage/cpython/dict_del_by_value.c:1261)
    ep = mp.ma_lookup(mp, <PyObject*><void*>key, PyObject_Hash(key))
  File "sage/rings/real_lazy.pyx", line 1398, in sage.rings.real_lazy.LazyNamedUnop.__hash__ (build/cythonized/sage/rings/real_lazy.c:15533)
    return hash(complex(self))
  File "sage/rings/real_lazy.pyx", line 822, in sage.rings.real_lazy.LazyFieldElement.__complex__ (build/cythonized/sage/rings/real_lazy.c:9872)
    return self.eval(complex)
  File "sage/rings/real_lazy.pyx", line 1352, in sage.rings.real_lazy.LazyNamedUnop.eval (build/cythonized/sage/rings/real_lazy.c:14982)
    arg = self._arg.eval(R)
  File "sage/rings/real_lazy.pyx", line 1129, in sage.rings.real_lazy.LazyBinop.eval (build/cythonized/sage/rings/real_lazy.c:12722)
    left = self._left.eval(R)
  File "sage/rings/real_lazy.pyx", line 1130, in sage.rings.real_lazy.LazyBinop.eval (build/cythonized/sage/rings/real_lazy.c:12734)
    right = self._right.eval(R)
  File "sage/rings/real_lazy.pyx", line 1647, in sage.rings.real_lazy.LazyAlgebraic.eval (build/cythonized/sage/rings/real_lazy.c:17915)
    self.eval(self.parent().interval_field(64)) # up the prec
  File "sage/rings/real_lazy.pyx", line 1673, in sage.rings.real_lazy.LazyAlgebraic.eval (build/cythonized/sage/rings/real_lazy.c:18066)
    roots = self._poly.roots(ring = AA if isinstance(self._parent, RealLazyField_class) else QQbar)
  File "sage/rings/polynomial/polynomial_element.pyx", line 7721, in sage.rings.polynomial.polynomial_element.Polynomial.roots (build/cythonized/sage/rings/polynomial/polynomial_element.c:61844)
    rts = complex_roots(self, retval='algebraic')
  File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/rings/polynomial/complex_roots.py", line 258, in complex_roots
    rts = cfac.roots(multiplicities=False)
  File "sage/rings/polynomial/polynomial_element.pyx", line 7629, in sage.rings.polynomial.polynomial_element.Polynomial.roots (build/cythonized/sage/rings/polynomial/polynomial_element.c:59297)
    ext_rts = self.__pari__().polroots(precision=L.prec())
  File "sage/rings/polynomial/polynomial_element.pyx", line 6021, in sage.rings.polynomial.polynomial_element.Polynomial.__pari__ (build/cythonized/sage/rings/polynomial/polynomial_element.c:49003)
    return self._pari_with_name(self._parent.variable_name())
  File "sage/rings/polynomial/polynomial_element.pyx", line 6074, in sage.rings.polynomial.polynomial_element.Polynomial._pari_with_name (build/cythonized/sage/rings/polynomial/polynomial_element.c:49395)
    vals = [x.__pari__() for x in self.list()]
  File "sage/rings/complex_number.pyx", line 593, in sage.rings.complex_number.ComplexNumber.__pari__ (build/cythonized/sage/rings/complex_number.c:7107)
    return self.real().__pari__()
  File "sage/rings/real_mpfr.pyx", line 3248, in sage.rings.real_mpfr.RealNumber.__pari__ (build/cythonized/sage/rings/real_mpfr.c:22355)
    sig_on()

Here it's probably building the reference docs for sage.categories.finite_coxeter_groups which contains a plot using CoxeterGroup.permutahedron(), which in turn much further down happens to use PARI to compute some complex roots of a polynomial.

It looks like it's not actually reaching the polroots() call before crashing somewhere down in converting the polynomial coeffs to PARI values...?

Last edited 4 years ago by embray (previous) (diff)

comment:33 Changed 4 years ago by embray

The nffactor function in PARI ("Factorization of the univariate polynomial (or rational function) T over the number field nf") internally uses a parallel Chinese remainder routine in the internal function polint_chinese. This results in a thread pool being launched. This is being called at some point during the categories reference doc build.

comment:34 Changed 4 years ago by dimpase

If #26002 has any effect, the pool should have just a single thread, no?

comment:35 Changed 4 years ago by embray

In further digging, yes, I just found #26002 and that indeed pari has pari_mt_nbthreads = 1 so maybe I'm barking up the wrong tree. The polint_chinese call indeed does not use any threads in this case.

comment:36 Changed 4 years ago by dimpase

hmm, maybe it still does something with thread local variables?

comment:37 Changed 4 years ago by embray

On a wild guess, I tried switching the docbuild to use my multiprocessing.Pool replacement from #27490 (this is on Linux), and the crashes no longer occur.

The major difference is that in multiprocessing.Pool, each docbuild subprocess is forked from a separate thread from the main thread, whereas in my replacement they're just forked directly from the main thread.

Somehow this alone is enough to leave some structures in PARI in a bad state, and only if it was built with multithreading support in the first place (apparently).

comment:38 Changed 4 years ago by embray

Going back to comment 30, if --mt=pthread then a variable in PARI's Configure, enable_tls="yes" is set. This in turns leads to defining a macro in paricfg.h called ENABLE_TLS.

The only effect of ENABLE_TLS is in src/headers/parisys.h:

#ifdef ENABLE_TLS
#  define THREAD __thread
#else
#  define THREAD
#endif

So variables in PARI declared THREAD use TLS in this case. This alone could be enough to suspect strange behavior in PARI when it's forked from a thread...

comment:39 in reply to:  37 Changed 4 years ago by dimpase

Replying to embray:

On a wild guess, I tried switching the docbuild to use my multiprocessing.Pool replacement from #27490 (this is on Linux), and the crashes no longer occur.

This is matching what I observed on #28242 (and the workaround I added to that branch) -- so we're in the same wilderness :-)

comment:40 in reply to:  38 Changed 4 years ago by embray

Replying to embray:

So variables in PARI declared THREAD use TLS in this case. This alone could be enough to suspect strange behavior in PARI when it's forked from a thread...

Yup. That's really all it is. PARI has tons of global variables declared __thread, including but not least of all pari_mainstack. When multiprocessing.Pool spins up a new thread, all of those __thread variables are suddenly set back to their initialization values (0x0, typically).

This isn't so unusual in PARI's case. It assumes that it's the only one that will be starting new threads that it manages. It does not assume it will ever be used in someone else's multi-threaded application.

comment:41 Changed 3 years ago by embray

#28356 proposes a workaround for this issue.

It won't solve the issue in general (PARI is not safe to use in arbitrary multi-threaded code), but at least it won't crash when building the docs.

comment:42 Changed 3 years ago by jdemeyer

Did somebody check that this problem is limited to docbuilds? Does the Sage testsuite pass (apart from doc-related tests of course)?

comment:43 in reply to:  42 Changed 3 years ago by arojas

Replying to jdemeyer:

Did somebody check that this problem is limited to docbuilds? Does the Sage testsuite pass (apart from doc-related tests of course)?

Yes, there are no test suite failures related to this.

comment:44 in reply to:  42 Changed 3 years ago by embray

Replying to jdemeyer:

Did somebody check that this problem is limited to docbuilds? Does the Sage testsuite pass (apart from doc-related tests of course)?

In principle it's not just limited to docbuilds: The broader problem, for which I would like to find a better resolution, is that the cypari2 Pari instances are simply not thread-safe. A more general approach would be if Sage's single Pari instance were actually thread-local.

As it is, Sage doesn't use threads for much of anything, whether in the tests, or in general, so the problem arises primarily in the docbuild.

But this can cause problems for anyone who carelessly tries to use multiprocessing.Pool, or threads in general, in Sage* :(

!* If they do anything in those threads that happens to use PARI.

Last edited 3 years ago by embray (previous) (diff)

comment:45 Changed 3 years ago by embray

Resolution: fixed
Status: newclosed

Closing this ticket for now, since the immediate problem (Sage's docbuild) has been worked around by #28356. The broader issue still should have a ticket to track it, even if it won't be resolved any time soon, so I have opened #28800.

Note: See TracTickets for help on using tickets.