The Wayback Machine - https://web.archive.org//web/20211011205426/https://github.com/erthink/libmdbx/issues/217
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashes and excessive memory + CPU consumption when dealing with corrupted databases #217

Closed
debrouxl opened this issue Jul 7, 2021 · 19 comments
Assignees
Labels

Comments

@debrouxl
Copy link

@debrouxl debrouxl commented Jul 7, 2021

I've now spent a bit of time fuzzing libmdbx, like I fuzzed Berkeley DB, LMDB, GDBM, TDB and other databases inspired by BDB and/or DBM in the past. Sorry, I didn't find about libmdbx until fairly recently...

In the README and Makefile, I read that you worked on fixing some crashes, and that you have asan & ubsan test targets - so clearly, you paid at least some level of attention to memory safety and UB, which is a good thing.
However, with a Time To First Crash around 2 minutes, '2021 libmdbx tolerates corrupted databases better than '2018 and '2021 LMDB do (TTFC << 1s on mdb_dump), and marginally better than Berkeley DB 18.1.40 (yes, the latest version, despite dozens of fixes for CVE-numbered issues over the years... I basically gave up reporting issues) does, but libmdbx is not quite fool-proof yet :)

Building and starting a first, simple fuzzing job is straightforward, along the lines of:

git clone https://github.com/erthink/libmdbx
cd libmdbx
AFL_USE_ASAN=1 CC=$HOME/AFLplusplus/afl-clang-fast DESTDIR=$HOME/libmdbx_prefix_asan make install V=1
cd ..
mkdir libmdbx_fuzz
cd libmdbx_fuzz/
mkdir input
echo "" | $HOME/libmdbx_prefix_asan/usr/local/bin/mdbx_load -T -n empty
echo -e "key1\nvalue1" | $HOME/libmdbx_prefix_asan/usr/local/bin/mdbx_load -T -n one
cd ..
mkdir /dev/shm/libmdbx_tmpdir
AFL_TMPDIR=/dev/shm/libmdbx_tmpdir $HOME/AFLplusplus/afl-fuzz -i input -o output -M master -- $HOME/libmdbx_prefix_asan/usr/local/bin/mdbx_chk @@

(the AFL++ setup, which basically reduces to git clone and make when the build dependencies are installed, is not described here, for brevity)

I stopped the mdbx_chk fuzzing process a bit after reaching 1M execs. Triaging the crashes already showed 5 unique code locations and SIGBUS, SIGSEGV, weirdness when unpoisoning memory, use after poison through wild pointers: that's enough to warrant creating this issue and provide the information which can enable you to perform your own fuzzing jobs.
The final afl-fuzz output was:

                american fuzzy lop ++3.14a (master) [fast] {0}
+- process timing ------------------------------------+- overall results ----+
|        run time : 0 days, 0 hrs, 42 min, 20 sec     |  cycles done : 12    |
|   last new path : 0 days, 0 hrs, 0 min, 54 sec      |  total paths : 240   |
| last uniq crash : 0 days, 0 hrs, 0 min, 18 sec      | uniq crashes : 48    |
|  last uniq hang : 0 days, 0 hrs, 6 min, 3 sec       |   uniq hangs : 15    |
+- cycle progress ---------------------+- map coverage+----------------------+
|  now processing : 3*2 (1.2%)         |    map density : 6.16% / 10.04%     |
| paths timed out : 0 (0.00%)          | count coverage : 1.65 bits/tuple    |
+- stage progress ---------------------+- findings in depth -----------------+
|  now trying : splice 3               | favored paths : 81 (33.75%)         |
| stage execs : 62/110 (56.36%)        |  new edges on : 120 (50.00%)        |
| total execs : 1.05M                  | total crashes : 8155 (48 unique)    |
|  exec speed : 447.0/sec              |  total tmouts : 78 (29 unique)      |
+- fuzzing strategy yields ------------+-------------+- path geometry -------+
|   bit flips : disabled (default, enable with -D)   |    levels : 11        |
|  byte flips : disabled (default, enable with -D)   |   pending : 55        |
| arithmetics : disabled (default, enable with -D)   |  pend fav : 0         |
|  known ints : disabled (default, enable with -D)   | own finds : 235       |
|  dictionary : n/a                                  |  imported : 0         |
|havoc/splice : 202/531k, 81/509k                    | stability : 99.54%    |
|py/custom/rq : unused, unused, unused, unused       +-----------------------+
|    trim/eff : disabled, disabled                   |          [cpu000:100%]
+----------------------------------------------------+

The crash triage output is part of the tarball.

NOTE: in order to reproduce crashes, the best practice is to start from fresh copies of the files. The output of AddressSanitizer killing mdbx_chk on the attached files seems to be stable from one run to the next one (apart from randomized addresses, of course), but for instance, starting from fresh files is definitely necessary for reproducing a subset of the endless stream of crashes in Berkeley DB.

Ideas for improving the next stages of the fuzzing process:

  • first and foremost, using a much wider input corpus. The 2-file corpus I used is enough to show that libmdbx's tolerance to offline data corruption / specially crafted files needs improvements, but there should be tests with > 1 keys / values, sub-databases, different page sizes, both endiannesses, databases after some creations and deletions of items, etc.
  • fuzzing the other CLI front-ends as well;
  • using ubsan and msan instrumentation for fuzzing (unless using UMRs is an integral part of the way libmdbx works, but I doubt it);
  • using AFL persistent mode, which often speeds up the fuzzing process;
  • using Honggfuzz (and also its persistent mode): I use it less often than AFL++, but in some of my past runs, it found some interesting testcases that AFL didn't.

Looking forward to the fixes which will make libmdbx even more production ready ;)

mdbx_chk_asan_crashes_20210707.tar.gz

@debrouxl
Copy link
Author

@debrouxl debrouxl commented Jul 7, 2021

A couple hours of fuzzing the msan-instrumented version, which runs about an order of magnitude slower than the asan-instrumented version, produced 42 "unique" crashes which all reduce to the same location as most other testcases which also trip the asan-instrumented version: mdbx_page_get_ex at .../libmdbx/src/core.c:13043:7.

             american fuzzy lop ++3.14a (master_msan) [fast] {0}
┌─ process timing ────────────────────────────────────┬─ overall results ────┐
│        run time : 0 days, 2 hrs, 0 min, 36 sec      │  cycles done : 9     │
│   last new path : 0 days, 0 hrs, 16 min, 54 sec     │  total paths : 243   │
│ last uniq crash : 0 days, 0 hrs, 0 min, 56 sec      │ uniq crashes : 42    │
│  last uniq hang : none seen yet                     │   uniq hangs : 0     │
├─ cycle progress ─────────────────────┬─ map coverage┴──────────────────────┤
│  now processing : 3*1 (1.2%)         │    map density : 5.97% / 9.88%      │
│ paths timed out : 0 (0.00%)          │ count coverage : 1.65 bits/tuple    │
├─ stage progress ─────────────────────┼─ findings in depth ─────────────────┤
│  now trying : havoc                  │ favored paths : 81 (33.33%)         │
│ stage execs : 38/176 (21.59%)        │  new edges on : 122 (50.21%)        │
│ total execs : 221k                   │ total crashes : 2778 (42 unique)    │
│  exec speed : 29.22/sec (slow!)      │  total tmouts : 0 (0 unique)        │
├─ fuzzing strategy yields ────────────┴─────────────┬─ path geometry ───────┤
│   bit flips : disabled (default, enable with -D)   │    levels : 3         │
│  byte flips : disabled (default, enable with -D)   │   pending : 32        │
│ arithmetics : disabled (default, enable with -D)   │  pend fav : 0         │
│  known ints : disabled (default, enable with -D)   │ own finds : 9         │
│  dictionary : n/a                                  │  imported : 229       │
│havoc/splice : 5/62.5k, 46/154k                     │ stability : 99.76%    │
│py/custom/rq : unused, unused, unused, unused       ├───────────────────────┘
│    trim/eff : disabled, disabled                   │          [cpu000: 83%]
└────────────────────────────────────────────────────┘

@debrouxl
Copy link
Author

@debrouxl debrouxl commented Jul 7, 2021

Triaging the testcases flagged as hangs by afl-fuzz (significant CPU time outliers) shows runaway CPU and memory consumption. mdbx_chk commits allocated memory at a rate of gigabytes per second, until the Linux OOM killer kicks in to save the system. libmdbx is not the first code base displaying such behaviour on my computer when dealing with corrupted input.

Jul  7 22:55:34 hostname kernel: [801454.621961] Out of memory: Killed process 537258 (mdbx_chk) total-vm:24830357632kB, anon-rss:12189300kB, file-rss:2012kB, shmem-rss:0kB, UID:1000 pgtables:24356kB oom_score_adj:0
Jul  7 22:55:39 hostname kernel: [801459.841971] Out of memory: Killed process 537262 (mdbx_chk) total-vm:22548656256kB, anon-rss:12195952kB, file-rss:2016kB, shmem-rss:4kB, UID:1000 pgtables:24384kB oom_score_adj:0
Jul  7 22:55:49 hostname kernel: [801469.479816] Out of memory: Killed process 537271 (mdbx_chk) total-vm:21609371776kB, anon-rss:12184332kB, file-rss:2028kB, shmem-rss:4kB, UID:1000 pgtables:24336kB oom_score_adj:0
Jul  7 22:55:54 hostname kernel: [801474.487720] Out of memory: Killed process 537274 (mdbx_chk) total-vm:21474916012kB, anon-rss:12190908kB, file-rss:1960kB, shmem-rss:16kB, UID:1000 pgtables:24408kB oom_score_adj:0
Jul  7 22:56:03 hostname kernel: [801483.723544] Out of memory: Killed process 537282 (mdbx_chk) total-vm:25769881728kB, anon-rss:12194924kB, file-rss:2120kB, shmem-rss:4kB, UID:1000 pgtables:24392kB oom_score_adj:0
Jul  7 22:56:09 hostname kernel: [801489.754089] Out of memory: Killed process 537285 (mdbx_chk) total-vm:21474916012kB, anon-rss:12187120kB, file-rss:2080kB, shmem-rss:16kB, UID:1000 pgtables:24376kB oom_score_adj:0
Jul  7 22:56:15 hostname kernel: [801496.026316] Out of memory: Killed process 537291 (mdbx_chk) total-vm:21474916012kB, anon-rss:12190812kB, file-rss:2080kB, shmem-rss:16kB, UID:1000 pgtables:24376kB oom_score_adj:0
Jul  7 22:56:24 hostname kernel: [801504.593456] Out of memory: Killed process 537305 (mdbx_chk) total-vm:29997740160kB, anon-rss:12192060kB, file-rss:1976kB, shmem-rss:4kB, UID:1000 pgtables:24360kB oom_score_adj:0
Jul  7 22:56:29 hostname kernel: [801509.835272] Out of memory: Killed process 537308 (mdbx_chk) total-vm:24439282992kB, anon-rss:12191236kB, file-rss:2036kB, shmem-rss:4kB, UID:1000 pgtables:24360kB oom_score_adj:0
Jul  7 22:56:35 hostname kernel: [801515.190677] Out of memory: Killed process 537311 (mdbx_chk) total-vm:22011785344kB, anon-rss:12196192kB, file-rss:2036kB, shmem-rss:0kB, UID:1000 pgtables:24352kB oom_score_adj:0

The 20+ TB total-vm figures are simply (though partially) a consequence of the way AddressSanitizer works, but the anon-rss figures are caused by mdbx_chk's operation.

mdbx_chk_asan_hangs_20210707.tar.gz - these files can perform partial DoS on most computers, you have been warned ;)

@debrouxl debrouxl changed the title Crashes when dealing with corrupted databases Crashes and excessive memory + CPU consumption when dealing with corrupted databases Jul 7, 2021
@erthink
Copy link
Owner

@erthink erthink commented Jul 7, 2021

Thank for your attention and reporting.

By the first part of the traces I found that a some check was missed in at least one place, which is why a damaged meta-page is not skipped and as a result subsequent fails occur.

Hope tomorrow during the day I will provide a fix.

@erthink erthink self-assigned this Jul 7, 2021
@erthink erthink added the bug label Jul 7, 2021
erthink added a commit that referenced this issue Jul 10, 2021
erthink added a commit that referenced this issue Jul 10, 2021
erthink added a commit that referenced this issue Jul 10, 2021
…is 100 larger than RAM.

More for second case of #217
erthink added a commit that referenced this issue Jul 10, 2021
@debrouxl
Copy link
Author

@debrouxl debrouxl commented Jul 13, 2021

I've taken the current contents of the devel branch, fb78c5f , for another short test drive, restarting from the previous fuzzing output directory - which means that my input corpus still sucks. I only fuzzed the ASAN-instrumented code.
You're on the right track: your changes have decreased the crash rate by over an order of magnitude: less than 1K crashes in over 3.5M execs :)
mdbx_chk_asan_crashes_20210713.tar.gz contains samples and stack traces. The restart phase shows a nullptr deref of write type.

Note that some of these samples trigger the production of 100+ lines of error output by mdbx_chk, and I've seen some other malformed samples produce several thousand lines of error output, which is a weak form of DoS. You may want to add some automatic trimming of messages beyond some count, possibly with a final summary, and I guess with a way to disable said trimming.

erthink added a commit that referenced this issue Jul 14, 2021
erthink added a commit that referenced this issue Jul 14, 2021
…s corrupted.

Hope final for #217
@erthink
Copy link
Owner

@erthink erthink commented Jul 14, 2021

Once again - Thank for your attention and reporting.

Until now, I have not finished the work yet, so there was no new information here.

@erthink
Copy link
Owner

@erthink erthink commented Jul 14, 2021

@debrouxl, Briefly, there were the four problems/drawbacks/bugs:

  1. BUG: When reading the header during opening a database all the three meta-pages are read, verified and bad ones are skipped.
    However, then an invalid meta pages were not cleared/purged and were used if one contains a greater transaction number (i.e. information about the more recent transaction).
    This was the main problem and the reason for the vast majority of subsequent failures.
  2. DRAWBACK: A "fuzzied" meta page can still be valid/correct, but contain a very large upper limit of the database size.
    However, the MDBX maps a whole DB to RAM, so such a huge database requires a large address space region (number of PTE) regardless of the current amount of data. This is enough for a large memory consumption and triggering OOM-killer.
    However, in builds with ASAN, the situation was aggravated, since memory is required for the bitmask of the mem-state and the CPU costs for handling it. This was the cause of stalling and hangs.
  3. DRAWBACK: Historically, the mdbx_chk utility checks database data in two stages: it traverses a page tree to check and correctness of structure, and then iterates the records.
    Thus, any corruption will be detected at the first stage, and at the second stage you can see how this damage will affect the operation (for example, whether there will be SIGSEGV or other problems). However, such crashes with SIGSEGV or ASAN errors are not the best behavior for a database validation utility.
    So I have just disabled the iteration of records for now, if problems were found in a page tree.
    Next I will add command-line options to enable the previous behavior.
    Nonetheless, I should warn that with a corresponding corruption, the library will crash as the mdbx_chk before, but it is irrational to fix, since:
    • it will lead to a lot of overhead;
    • these checks do not guarantee the integrity of the data (but MithrilDB will have a Merkle-tree based verification).
    • This is not considered a (new or one more) attack vector, since (in general) it means that the attacker has the ability to rewrite data in memory, and not just attack the database.
  4. REGRESSION/BUG: When fixing the first problem, an error was made-accessing the buffer before it was allocated.
    This new year has not yet been fully tested, and you have helped with this.

@debrouxl
Copy link
Author

@debrouxl debrouxl commented Jul 14, 2021

You're right, full iteration of records makes less sense when the page tree is broken. Although high overhead would be tolerable for default-disabled checks, I tend to agree that attempting to make full traversal of a broken database memory-safe is lower priority :)

Earlier this morning, I rebuilt the code and started another instance with HEAD @ 4fc6d67 . The crash rate is still a bit over 1 per 2K execs, with two different stack traces so far, both use after poison. I'll let the fuzzer run for a longer time before posting samples.
Before that, I managed to produce databases with different page size by editing the output of mdbx_dump before feeding it to mdbx_load, which makes the input corpus slightly less bad. My first suggestion from #218 remains valid.
Next time, before starting a new fuzzer instance, I guess I'll throw some DBs with sub-databases in the mix.

@debrouxl
Copy link
Author

@debrouxl debrouxl commented Jul 14, 2021

On the newest mdbx_chk fuzzing job, barely more than three quarters of the paths are considered examined, but it's been over 3h since the latest so-called 'unique' crash was found, so it's time to post something :)
The crash rate has remained approximately constant throughout the past nearly 14h, the 10K crashes mark is about to be crossed while the 20M execs mark was just crossed.

I have packed up all of the files which used to crash, or still crash, mdbx_chk @ 4fc6d67 : mdbx_chk_asan_crashes_20210714.tar.gz
In the output, there are three different source locations for crashes:

ERROR: AddressSanitizer: use-after-poison on address 0x7fc849f5e0bc at pc 0x000000586821 bp 0x7ffe10260d50 sp 0x7ffe10260d48
WRITE of size 4 at 0x7fc849f5e0bc thread T0
    #0 0x586820 in atomic_store32 .../libmdbx/src/internals.h:287:3
    #1 0x586820 in mdbx_setup_dxb .../libmdbx/src/core.c:11834:3
    #2 0x57100d in mdbx_env_open .../libmdbx/src/core.c:12595:22
    #3 0x4ccc78 in main .../libmdbx/src/mdbx_chk.c:1244:10
    #4 0x7fc849c1ad09 in __libc_start_main csu/../csu/libc-start.c:308:16
    #5 0x420669 in _start (.../libmdbx/mdbx_chk+0x420669)

Address 0x7fc849f5e0bc is a wild pointer.
SUMMARY: AddressSanitizer: use-after-poison .../libmdbx/src/internals.h:287:3 in atomic_store32

ERROR: AddressSanitizer: use-after-poison on address 0x7f6a113f20b8 at pc 0x00000064bc10 bp 0x7ffde3aaf2b0 sp 0x7ffde3aaf2a8
READ of size 4 at 0x7f6a113f20b8 thread T0
    #0 0x64bc0f in mdbx_lck_destroy .../libmdbx/src/lck-posix.c:519:51
    #1 0x58f884 in lcklist_detach_locked .../libmdbx/src/core.c:1617:10
    #2 0x58f884 in mdbx_env_close0 .../libmdbx/src/core.c:12708:18
    #3 0x570363 in mdbx_env_open .../libmdbx/src/core.c:12682:10
    #4 0x4ccc78 in main .../libmdbx/src/mdbx_chk.c:1244:10
    #5 0x7f6a110aed09 in __libc_start_main csu/../csu/libc-start.c:308:16
    #6 0x420669 in _start (.../libmdbx/mdbx_chk+0x420669)

Address 0x7f6a113f20b8 is a wild pointer.
SUMMARY: AddressSanitizer: use-after-poison .../libmdbx/src/lck-posix.c:519:51 in mdbx_lck_destroy

ERROR: AddressSanitizer: BUS on unknown address 0x7fc95947f000 (pc 0x00000065caeb bp 0x7ffdc48d16d0 sp 0x7ffdc48d0980 T0)
    #0 0x65caeb in unaligned_peek_u32 .../libmdbx/src/core.c:159:5
    #1 0x65caeb in peek_pgno .../libmdbx/src/core.c:315:20
    #2 0x65caeb in node_largedata_pgno .../libmdbx/src/core.c:338:10
    #3 0x65caeb in mdbx_page_check .../libmdbx/src/core.c:17166:39
    #4 0x6898dc in mdbx_walk_tree .../libmdbx/src/core.c:20380:11
    #5 0x60bb62 in mdbx_walk_sdb .../libmdbx/src/core.c:20629:8
    #6 0x60a941 in mdbx_env_pgwalk .../libmdbx/src/core.c:20653:10
    #7 0x4d2772 in main .../libmdbx/src/mdbx_chk.c:1497:10
    #8 0x7fc95914ad09 in __libc_start_main csu/../csu/libc-start.c:308:16
    #9 0x420669 in _start (.../libmdbx/mdbx_chk+0x420669)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: BUS .../libmdbx/src/core.c:159:5 in unaligned_peek_u32

@debrouxl
Copy link
Author

@debrouxl debrouxl commented Jul 16, 2021

No additional stack trace found since my previous message.

@erthink
Copy link
Owner

@erthink erthink commented Jul 16, 2021

Once again - Thank a lot for your attention and reporting.

erthink added a commit that referenced this issue Jul 16, 2021
Added a check that the data of the BIGDATA node (containing the target page number) is located within the boundaries of the page being checked.

The third case of #217.
erthink added a commit that referenced this issue Jul 16, 2021
@debrouxl
Copy link
Author

@debrouxl debrouxl commented Jul 16, 2021

And thanks to you for providing a set of fixes which raise the TTFC above ~1h30 for ASAN-instrumented mdbx_chk (the 3 currently running new instances, which synchronize their queue from the old instances, have yet to produce their first crash) in ~9 days. This is quite a short interval in my experience fuzzing various pieces of FLOSS since late 2014 - even those maintained by people employed by corporations to work at least part-time on the code bases :)
Also, no "hangs" were found so far, even if afl-fuzz flagged several hundred "timeouts" among the ~6M execs so far; these might partially be due to the fact the computer effectively slowed down when I added more afl-fuzz instances: lower CPU frequency to remain within the TDP cap, even duty-cycling the processor to minimal frequency at times.

I threw an afl-fuzz ... mdbx_dump -a -r instance and an afl-fuzz ... mdbx_stat -p -e -f -r -a instance into the mix one hour later. Still enough to find that for these two (also ASAN-instrumented) programs, the TTFC is around 1 minute, effectively soon after the initial sync from the other cooperating fuzzer instances' queues starts.
I'll post samples later, when the instances have produced a bit more work... ~30' wall clock and ~2M mdbx_dump+mdbx_stat total execs aren't much. The crash ratio is below 1 per 1K execs, it's not that bad - I already saw a ~15% crash ratio.
I forgot to add DBs with sub-databases before the new runs, but they can be injected by adding an afl-fuzz ... mdbx_drop instance... I'll do that after posting this comment. EDIT: done, TTFC < 10s, crash ratio > 1% over the first 100K execs (despite over a third of these being spent on trimming DBs...).

@erthink
Copy link
Owner

@erthink erthink commented Jul 16, 2021

@debrouxl , thank you for your work.
You provide significant assistance, because having ready-made data about problems, I can focus on fix ones.

However, I think that you should not use anything other than mdbx_chk when fuzzing right now.
As I wrote earlier, it is unreasonable to add full-fledged checks when the engine/library is used usual way.
Therefore, the crashes of all utilities except mdbx_chk are predicted/expected and will not give a new information.

Nonetheless, a full check of the database pages used can be added as an option, i.e. by an explicit user request.
This does not require many changes, but I will do it after the release of the next stable version (it's time).
After that, it will be sense to engage the rest of the mdbx-tools in fuzzing.

@debrouxl
Copy link
Author

@debrouxl debrouxl commented Jul 16, 2021

You're welcome :)

Well, the help text for mdbx_dump states that its -r option aims at dumping data from corrupted databases :)
But your changes in core libmdbx code already made it more robust.

While working on corrupted databases is indeed not the main intended use case for mdbx_drop + mdbx_stat, the harder to fuzz mdbx_copy + mdbx_load, or a generic CLI front-end like the one I recently suggested in #218 (should you consider it a valuable suggestion for later implementation in libmdbx and/or its successor), BDB, GDBM and TDB attempt not to crash when dealing with broken DBs. BDB keeps failing at it, but recent GDBM (post https://puszcza.gnu.org.ua/bugs/?503 and earlier private reports) does quite a good job, and after several fixes, TDB did a good job in 2018.
Of course, the flip side of the coin is that some of these safety measures can only reduce raw speed, though I have no idea by how much... and one of the hallmarks of liblmdb / libmdbx is their raw speed. Therefore, a tradeoff needs to be drawn here, quite possibly along the lines of adding a safer mode, as you suggest.

I've killed the mdbx_dump, mdbx_stat and mdbx_drop (x2, a temporary instance used to inject another pair of testcases) fuzzing jobs. Here are the files for later investigation at your leisure: mdbx_dumpdropstat_010167_crashes_20210717.tar.gz . The three mdbx_chk instances keep running for now, still zero crash or hang after ~3h30 wall clock time.

@debrouxl
Copy link
Author

@debrouxl debrouxl commented Jul 17, 2021

[Post significantly edited ~13h later.]
Overnight, all 3 mdbx_chk instances found crashes by ~11h wall clock time. The crash ratio is slowly raising to 10 ppm: from ~200 crashes in nearly 50M execs this morning to ~740 crashes in slightly over 100M execs now.
In the output, I see a libmdbx source location causing SIGBUS, a libmdbx source location causing use after poison... and a bunch of internal AddressSanitizer sanity check errors !
mdbx_chk_asan_crashes_20210717_02.tar.gz , this tarball is a superset from the one I posted this morning.

@debrouxl
Copy link
Author

@debrouxl debrouxl commented Jul 18, 2021

Today's status update:

  • over 200M execs, and nearly 3000 crashes;
  • no new source locations, AFAICS.

Nevertheless, here's an updated tarball, again a superset of the previous one: mdbx_chk_asan_crashes_20210718.tar.gz .
I plan on letting the fuzzers run for several more days. Reaching 1G execs would be nice, but would take another ~8 days, so I might stop before then. Adding other instances on the same computer would only provide sub-linear scaling, due to thermal throttling.

FTR, the use-after-poison .../libmdbx/src/core.c:115:10 in peek_u8 source location flagged by fuzzing mdbx_chk also occurs in the output of mdbx_dump -a -r on samples which crash that program, and reciprocally, mdbx_dump -a -r also crashes on some of the files which crash mdbx_chk at that location. Therefore, you'll kill (at least) two birds in one stone by fixing both issues from this batch :)
The other source location flagged by fuzzing mdbx_chk, BUS .../libmdbx/src/core.c:13620:9 in mdbx_cursor_next, does not reproduce with the samples from mdbx_dump -a -r I posted previously; however, mdbx_dump -a -r shows another issue in the same function: BUS .../libmdbx/src/core.c:13677:7 in mdbx_cursor_next.

erthink added a commit that referenced this issue Jul 18, 2021
The fourth case of #217.
erthink added a commit that referenced this issue Jul 18, 2021
The fourth case of #217.
erthink added a commit that referenced this issue Jul 19, 2021
@debrouxl
Copy link
Author

@debrouxl debrouxl commented Jul 20, 2021

More than 400M execs, and more than 4000 crashes. No new source location for crashes :)

@erthink
Copy link
Owner

@erthink erthink commented Jul 20, 2021

@debrouxl, I don't expect any mdbx_chk crashes from the current devel branch (since e7336e1).

@erthink
Copy link
Owner

@erthink erthink commented Jul 21, 2021

I am closing this issue, since all noticed problems have been fixed in the master branch and an implementation of the safe mode will done within #223.

Please fire a new issue if new bug(s) will be found by fuzzing.

@debrouxl, Thank for your work!

@erthink erthink closed this Jul 21, 2021
@debrouxl
Copy link
Author

@debrouxl debrouxl commented Jul 21, 2021

You're welcome :)
libmdbx's resistance to data corruption was already slightly better than liblmdb's when I started fuzzing it, it's now a class above after your fixes. For a work project, I once had liblmdb in mind, but its unsafety made it a non-starter; when libmdbx has a safe mode - and the project focus is on changing the area of the code where a project like liblmdb/libmdbx would fit... - libmdbx can become an option.

I can confirm that all of the problems noticed (on mdbx_chk, that is) have been fixed. Yesterday evening, I started a 4th instance of mdbx_chk built from the devel branch, version v0.10.1-73-g9a1dffc; ~15h later, it has found no crashes yet.
I'll open a new issue if new crashes are found in mdbx_chk.

erthink added a commit that referenced this issue Jul 26, 2021
Acknowledgements:
-----------------
 - [Alex Sharov](https://github.com/AskAlexSharov) for reporting and testing.
 - [Andrea Lanfranchi](https://github.com/AndreaLanfranchi) for reporting bugs.
 - [Lionel Debroux](https://github.com/debrouxl) for fuzzing tests and reporting bugs.
 - [Sergey Fedotov](https://github.com/SergeyFromHell/) for [`node-mdbx` NodeJS bindings](https://www.npmjs.com/package/node-mdbx).
 - [Kris Zyp](https://github.com/kriszyp) for [`lmdbx-store` NodeJS bindings](https://github.com/kriszyp/lmdbx-store).
 - [Noel Kuntze](https://github.com/Thermi) for [draft Python bindings](https://github.com/erthink/libmdbx/commits/python-bindings).

New features, extensions and improvements:
------------------------------------------
 - [Allow to predefine/override `MDBX_BUILD_TIMESTAMP` for builds reproducibility](#201).
 - Added options support for `long-stochastic` script.
 - Avoided `MDBX_TXN_FULL` error for large transactions when possible.
 - The `MDBX_READERS_LIMIT` increased to `32767`.
 - Raise `MDBX_TOO_LARGE` under Valgrind/ASAN if being opened DB is 100 larger than RAM (to avoid hangs and OOM).
 - Minimized the size of poisoned/unpoisoned regions to avoid Valgrind/ASAN stuck.
 - Added more workarounds for QEMU for testing builds for 32-bit platforms, Alpha and Sparc architectures.
 - `mdbx_chk` now skips iteration & checking of DB' records if corresponding page-tree is corrupted (to avoid `SIGSEGV`, ASAN failures, etc).
 - Added more checks for [rare/fuzzing corruption cases](#217).

Backward compatibility break:
-----------------------------
 - Use file `VERSION.txt` for version information instead of `VERSION` to avoid collision with `#include <version>`.
 - Rename `slice::from/to_FOO_bytes()` to `slice::envisage_from/to_FOO_length()'.
 - Rename `MDBX_TEST_EXTRA` make's variable to `MDBX_SMOKE_EXTRA`.
 - Some details of the C++ API have been changed for subsequent freezing.

Fixes:
------
 - Fixed excess meta-pages checks in case `mdbx_chk` is called to check the DB for a specific meta page and thus could prevent switching to the selected meta page, even if the check passed without errors.
 - Fixed [recursive use of SRW-lock on Windows cause by `MDBX_NOTLS` option](#203).
 - Fixed [log a warning during a new DB creation](#205).
 - Fixed [false-negative `mdbx_cursor_eof()` result](#207).
 - Fixed [`make install` with non-GNU `install` utility (OSX, BSD)](#208).
 - Fixed [installation by `CMake` in special cases by complete use `GNUInstallDirs`'s variables](#209).
 - Fixed [C++ Buffer issue with `std::string` and alignment](#191).
 - Fixed `safe64_reset()` for platforms without atomic 64-bit compare-and-swap.
 - Fixed hang/shutdown on big-endian platforms without `__cxa_thread_atexit()`.
 - Fixed [using bad meta-pages if DB was partially/recoverable corrupted](#217).
 - Fixed extra `noexcept` for `buffer::&assign_reference()`.
 - Fixed `bootid` generation on Windows for case of change system' time.
 - Fixed [test framework keygen-related issue](#127).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants