56346 - Need to cache *all* pages for session history purposes

(Reporter)

Description

•

18 years ago

When browsing in session history we should be able to display previous pages
without fetching them again from the network, regardless if they are "cachable"
or not. Therefore *all* pages should be cached, if not in disk cache, then in
memory cache for re-use within the current session when requested by browsing
in session history.

One special case are results of a HTTP POST (see bug 55055). In that case we
must be able to associate the cache entry with the post data or a dedicated
session history entry.

Reasons are explained in RFC 2616 (HTTP/1.1):

--8<----

13.13 History Lists

   User agents often have history mechanisms, such as "Back" buttons and
   history lists, which can be used to redisplay an entity retrieved
   earlier in a session.

   History mechanisms and caches are different. In particular history
   mechanisms SHOULD NOT try to show a semantically transparent view of
   the current state of a resource. Rather, a history mechanism is meant
   to show exactly what the user saw at the time when the resource was
   retrieved.

   By default, an expiration time does not apply to history mechanisms.
   If the entity is still in storage, a history mechanism SHOULD display
   it even if the entity has expired, unless the user has specifically
   configured the agent to refresh expired history documents.

   This is not to be construed to prohibit the history mechanism from
   telling the user that a view might be stale.

      Note: if history list mechanisms unnecessarily prevent users from
      viewing stale resources, this will tend to force service authors
      to avoid using HTTP expiration controls and cache controls when
      they would otherwise like to. Service authors may consider it
      important that users not be presented with error messages or
      warning messages when they use navigation controls (such as BACK)
      to view previously fetched resources. Even though sometimes such
      resources ought not to cached, or ought to expire quickly, user
      interface considerations may force service authors to resort to
      other means of preventing caching (e.g. "once-only" URLs) in order
      not to suffer the effects of improperly functioning history
      mechanisms.

--8<----

Andreas M. "Clarence" Schneider

(Reporter)

Updated

•

18 years ago

Blocks: 55055

David Baron :dbaron: ⌚️UTC-8

Updated

•

18 years ago

CC: dbaron, dr

Scott Brodmerkle

Updated

•

18 years ago

CC: brodms

R.K.Aa.

Updated

•

18 years ago

CC: dark

Andreas Franke (gone)

Comment 1

•

18 years ago

Since this bug blocks bug 55055 (per Radha) which is nominated for 6.01, I'm
confirming this bug and nominating it also. Sounds like it's too late for RTM...

Status: UNCONFIRMED → NEW

Ever confirmed: true

Keywords: ns601

Matthew Paul Thomas

Updated

•

18 years ago

CC: mpt

Ben Bucksch (:BenB)

Comment 2

•

18 years ago

There is a related bug about Save As and such.

Randell Jesup [:jesup]

Updated

•

18 years ago

Severity: normal → critical

Keywords: dataloss

Randell Jesup [:jesup]

Comment 3

•

18 years ago

Severity to critical.  This can cause dataloss or even monetary loss.  See bug
57880 for details.

Marcus Pallinger

Updated

•

18 years ago

CC: marcus

Marcus Pallinger

Comment 4

•

18 years ago

Hmmm. This bug looks like it's associated with bug 40867, which talks about
needing a mechinism for storing pages for the puproses of save as and view
source, which also re-load data. (Could it be a dupe of this one?)

Ben Bucksch (:BenB)

Comment 5

•

18 years ago

Yes, this is the bug I meant. Adding as blocker.

Depends on: 40867

Ben Bucksch (:BenB)

Updated

•

18 years ago

Keywords: perf

Jens-Uwe

Updated

•

18 years ago

CC: jens-uwe

Radha on family leave (not reading bugmail)

Comment 6

•

18 years ago

Nominating for the next release like 55055 (which depends on this one)

Keywords: ns601 → highrisk, mozilla0.9

Hixie (not reading bugmail)

Updated

•

18 years ago

Keywords: highrisk

Erich 'Ricky' Iseli

Updated

•

17 years ago

CC: Erich.Iseli

Stephen Ostermiller

Updated

•

17 years ago

CC: 1010mozilla

Eugene Savitsky

Updated

•

17 years ago

CC: ezh

Deven Corzine

Updated

•

17 years ago

CC: deven

pohl

Comment 7

•

17 years ago

cc to self

Blake Ross

Updated

•

17 years ago

CC: ? mpt, pohl → ? blakeross, mpt

Aaron Andersen

Updated

•

17 years ago

CC: ? mozilla, mpt → ? kc7gza, mozilla

Moses Lei

Updated

•

17 years ago

CC: ? mozilla → ? mlei.

Gagan

Comment 8

•

17 years ago

a "can-do" for new cache. marking till that lands.

Target Milestone: --- → mozilla0.9

Ben Bucksch (:BenB)

Comment 9

•

17 years ago

> a "can-do" for new cache.

Please note that this bug blocks an IMO important [nsbeta1+] bug.

Mike Pinkerton (not reading bugmail)

Comment 10

•

17 years ago

shouldn't we then make this a "blocker" severity? this bug is blocking 55055, a
highly visible, dataloss, nsbeta1+ bug.

Severity: critical → blocker

neeti

Comment 11

•

17 years ago

This bug requires major architecture changes to the current cache. This will be 
resolved when the code for the new cache lands.

Radha on family leave (not reading bugmail)

Comment 12

•

17 years ago

After discussing with gordon and neeti, marking this dupe of 20843. 55055 which
is dependant on this will be re-wired to 20843.

Radha on family leave (not reading bugmail)

Comment 13

•

17 years ago

Actually making it a dupe

*** This bug has been marked as a duplicate of 20843 ***

Status: NEW → RESOLVED

Last Resolved: 17 years ago

Resolution: --- → DUPLICATE

Radha on family leave (not reading bugmail)

Updated

•

17 years ago

No longer blocks: 55055

Andreas M. "Clarence" Schneider

(Reporter)

Comment 14

•

17 years ago

Bug 20843 covers only caching of HTTP form POSTs.
This bug is about a "history mechanism" (as defined in RFC 2616), which allows
even expired pages to be re-displayed in their original state when browsing in
session history.

Thus reopening this bug, but resetting severity, target milestone and keywords
for reevaluation (assuming 20843 will cover the most important issue with form
POSTs); adding dependency on bug 20843.

Severity: blocker → normal

Status: RESOLVED → REOPENED

Depends on: 20843

Keywords: dataloss, mozilla0.9, perf

Resolution: DUPLICATE → ---

Target Milestone: mozilla0.9 → ---

Andreas M. "Clarence" Schneider

(Reporter)

Updated

•

17 years ago

No longer depends on: 20843

gordon

Comment 15

•

17 years ago

RFC 2616 13.13 doesn't REQUIRE we store all pages.  It says "If the entity is 
still in storage.." we should display it, even if it has expired.

That is a separate issue than the fact that POST data is not currently 
incorporated as part of the key used for cache entries (covered by bug 20843).

And both are separate issues from enabling the use of the memory cache for non-
cacheable items (which is covered by bug 66482, among others).

Ben Bucksch (:BenB)

Comment 16

•

17 years ago

Readding keywords - they still apply, tho in less severity.

> separate issues from enabling the use of the memory cache for non-
> cacheable items (which is covered by bug 66482, among others).

If bug 66482 really covers non-cachable items (there's no mention of it, so I'd
assume, it's only about cacheable ones), what is left for this bug?

> RFC 2616 13.13 doesn't REQUIRE we store all pages. It says "If the entity
> is still in storage.."

It also says:
> a history mechanism is meant to show exactly what the user saw at the time
> when the resource was retrieved.

I guess, the if clause has been added, so UAs don't have to cache *everything*
(possibly hundreds of pages) in the history.

Severity: normal → critical

Keywords: dataloss, mozilla0.9, perf

Akkana Peck

Updated

•

17 years ago

CC: ? kc7gza, marcus → ? akkana, kc7gza@

Jonas Sicking (:sicking) No longer reading bugmail consistently

Comment 17

•

17 years ago

Also related to bug 38486

Matthew Tuck [:CodeMachine]

Updated

•

17 years ago

CC: ? ? → ? ?

Jacek Piskozub

Updated

•

17 years ago

CC: ? ? → ? ?

Dan Rosen

Updated

•

17 years ago

CC: ? dr, kc7gza@ → ? kc7gza, marc

neeti

Comment 18

•

17 years ago

Cache bugs to Gordon

Assignee: neeti → gordon

Status: REOPENED → NEW

Target Milestone: --- → mozilla0.9

neeti

Updated

•

17 years ago

CC: ? ? → ? ?

Randell Jesup [:jesup]

Updated

•

17 years ago

Blocks: 55055

Radha on family leave (not reading bugmail)

Updated

•

17 years ago

No longer blocks: 55055

gordon

Comment 19

•

17 years ago

It's really up to the cache client to decide what to store in the cache.  I'll 
let HTTP take the first crack at it.  To Darin, with love...

Assignee: gordon → darin

Component: Networking: Cache → Networking: HTTP

Darin Fisher

(Assignee)

Comment 20

•

17 years ago

I'm thinking that all no-cache responses can be handled using the cacheKey
attribute from nsICachingChannel.  we just need to add a load flag on the
HTTP channel to indicate that caching should be done even if it the protocol
would not consider the response valid for future requests.

these responses would be given a special cache key that would not equal the
URL.  this would mean that only clients with a reference to the cacheKey,
provided by HTTP, would be able to recover these responses from the cache.

radha: this would work just like it does for POST transactions.

Radha on family leave (not reading bugmail)

Comment 21

•

17 years ago

darin: From your comments above, this is what I understand docshell should do:

1) When a page with "no-cache" attribute is loaded for the very first time,  
docshell should verify that and  set a special load attribute on the channel, 
before handing it off to necko for loading. (I'm presuming you would tell me 
what that attibute is)

2) In OnStartRequest(), while saving the url in SH, I would also save the cache 
key for it, (just like I do for postdata results)

3) when the user goes back/forward to a page that had a "no-cache" setting, I 
would restore the cache key on the channel before handing it off to necko again.

Does this look right

Darin Fisher

(Assignee)

Comment 22

•

17 years ago

that almost exactly it, except that instead of docshell having to check for
documents with the "no-cache" tag, docshell would simply set a policy on the
channel telling it to go ahead and cache such pages (that it normally would
not cache).  the only necessary thing missing from our api is this flag.

Darin Fisher

(Assignee)

Comment 23

•

17 years ago

radha: this bug is assigned to me so that i can add this flag... once i've added
it, i'll reassign the bug to you so you can add the necessary docshell/history
support.

Gagan

Comment 24

•

17 years ago

Shouldn't the existant VALIDATE_NEVER on the nsIChannel satisfy this? The way I 
see it when docshell first requests a document it sends the standard set of 
cache flags (default being to LOAD_NORMAL) HTTP then caches it either in memory 
or on disk (but it ensures that its available somewhere-- even with a no-cache 
response) Then if the user hits back button docshell adds a VALIDATE_NEVER to 
the request which tells HTTP to simply return the document the way it is without 
checking for its validation. 

I think there is more benefits this way. If we were to create a new flag then it 
would be hard for us to detect when the window goes away (specially if I have 
multiple windows open and close one of them) and that its ok for us to throw 
something out of the cache. With the suggestion I am making this would get 
handled just like rest of the cached objects and we wont need a special way to 
clean out these "held for the rest of the session" objects. 

The same should work for POST results without much change from either SH or 
HTTP.

Darin Fisher

(Assignee)

Comment 25

•

17 years ago

ok, i've discussed this with gagan and gordon, and we've come to the conclusion
that HTTP should just put everything in the cache.  later on, i'd like to make
this more customizable, since some clients may not wish to implement session
history.  so this means, that the flag i was talking about earlier will not be
necessary at this time.

i'll submit a patch to enable caching of *all* documents.

Status: NEW → ASSIGNED

Deven Corzine

Comment 26

•

17 years ago

Gagan: How would that work with two different windows that had the same page in
their histories, but from different points in time when the content was
different?  When each window goes back, it should see only the version it had
originally displayed, regardless of other cached versions or changes in the live
page...

Gagan

Comment 27

•

17 years ago

IFF the pages (with identical URLs) were generated by standard GET hits and had 
no cache directives, the data for the older version would be replaced by the 
newer version. So going back in history on either of the windows will only show 
you the latest version of the pages. 

If this is not desirable then we'd have to figure out a way to keep these pages 
unique across other windows. A way to do that would be to associate a window-id 
with each of the cache-ids-- which if you think about it, will break basic 
caching behaviour. That is, pages being loaded in a newer window may not be able 
to reuse the cached version from an existing window. 

IMHO for these handful of cases with multiple windows where the content is 
replaced (and the older cache version lost) is a better deal than this.

Deven Corzine

Comment 28

•

17 years ago

Reread that quote from RFC 2616 in this bug's description: "History mechanisms
and caches are different. In particular history mechanisms SHOULD NOT try to
show a semantically transparent view of the current state of a resource. Rather,
a history mechanism is meant to show exactly what the user saw at the time when
the resource was retrieved."

Note that this says nothing about cache directives or which method was used to
retrieve the resource.  It flatly states that it's meant to show "exactly what
the user saw", not something similar that's more convenient for the application
to display.  To go back and show a newer copy of the page than was originally
shown in that window violates the RFC guidelines AND the user's reasonable
expectations -- why would we want that to happen?

Gagan

Comment 29

•

17 years ago

Deven: I understand the spec's recommendation-- the real problem is that our 
history implementation uses the cache. Which means that any changes happening to 
the cache are reflected in the history, unless we keep unique (in time) copies 
of pages.

It should be possible to use a window id to create an inbetween solution of 
keeping unique pages unique wrt window ids. 

darin/gordon/et al what say you?

Darin Fisher

(Assignee)

Comment 30

•

17 years ago

deven: while i understand your point, i don't think we should try to solve the
problem of history completely.  for MOST dynamic content, such as the results
from a form POST or the response from a '?' GET, we'll be storing separate
snapshots of the content so that history works as expected.  For all other
content, we don't plan on keeping historical snapshots... the thinking being
that this would be inefficient.  anyways, it is already the case that content
referenced by history may expire from the cache.  for pages with the same URL,
we'll just be expiring the older page sooner than what a LRU based algorithm
would prescribe.

forcibly keeping all historical content in the cache for the lifetime of the
browser sessions just seems to me like a recipe for bloat... note: we use the
MEMORY cache for HTTPS, so it's not like we are just talking about using more
or less disk space.

pohl

Comment 31

•

17 years ago

I want the same thing that deven wants, but I understand how a perfect solution
would be inefficient.  I really don't think that covering POST and GET-with-? is
a broad enough middle-solution, though.  I was thinking that there might be a
way to do slightly more without sacrificing too much:  what if we treated the
history operations differently, and only stored snapshots for those locations
that are reachable by the BACK button on some window -- that set should be much
smaller than all locations in the history, and I could live with the FORWARD
button behaving differently.

Darin Fisher

(Assignee)

Comment 32

•

17 years ago

actually, the default maximum number of times you can go back is 50... users
may set this to be whatever they like.  are you sure you want us to forcibly
keep around 50 full pages of content?  instead, we're aiming toward an impl
that uses the cache's eviction mechanism to control the number of full pages
"recallable" via history.  IMO this is the right solution.

pohl

Comment 33

•

17 years ago

It makes sense to set a tunable upper bound.

Andreas M. "Clarence" Schneider

(Reporter)

Comment 34

•

17 years ago

I think we should try to solve the problem of history completely.
But for 1.0 the proposed solution is IMHO enough.

In the long term we could keep "uncachable" cache entries unique per request and
limit them to a maximum (i.e. expire them more quickly than other pages).
I think a total of about 1 MB for such entries should be enough. Normally, only a
small part of that would be in memory. So it wouldn't introduce more additional
bloat than it is worth.

Dennis Andersen

Updated

•

17 years ago

CC: ? kc7gza, marc → ? deniande, kc7g

Deven Corzine

Comment 35

•

17 years ago

Gagan: You have to keep unique (in time) copies of form POST results; why should
GET requests be any different?  Fetching updated data from the cache is wrong,
especially for operations such as "Save As" and "View Source" when the page is
still being displayed.

Darin: I disagree -- I think we SHOULD be trying to solve the problem of history
completely.  For 1.0, not someday in the vague future.  Sacrificing correctness
for efficiency isn't acceptable, at least not on this scale.  We're not talking
about the difference between 99.99% and 100%, it's more like the difference
between 95% and 100%.  Also, I've seen no evidence that handling history
correctly would NECESSARILY be less efficient.  More complex?  Probably. 
B-Trees are much more complex than simple binary trees, yet they are
nevertheless far more efficient.  Complexity doesn't imply inefficiency.

I believe the heart of the problem lies in the mistaken assumption that the
cache is an appropriate place to store history information.  It isn't, and a
host of problems have resulted from the mistake of using the cache as a history
mechanism, including problems with form POST results, GET queries with ?, View
Source and Save As returning incorrect data, etc.  Worse yet, the loss of
correctness isn't negligible; it violates the user's expectations in a severe
way, and for that reason it isn't acceptable -- no matter how convenient it may
be or how many other implementations may do the same thing.

This is an architectural problem, and the right solution will be architectural
in nature -- continually manipulating details of how the cache works will never
provide a satisfactory solution, because a LRU cache is inappropriate for
storing history information in the first place.  I pointed this out months ago
(on 2000-12-28) in bug 40867, and even offered to try to attack the problem
myself (in what little time I have), but I was told that Gordon was working on a
solution so I left it alone.

There's no reason why a proper history mechanism must translate into "bloat". 
First, calling it "bloat" is mischaracterizing it in the first place; it's saved
data, not an inadvertent waste of memory.  (People don't call Linux "bloated"
for using all available memory for disk caching.)  Also, it's not inherently
necessary to keep history content in memory; there's no reason it couldn't be
offloaded to disk.  And limits on the total amount of memory and disk space used
would also be feasible -- as user-tunable policy mechanisms, not accidents of
implementation.  The only history content that should be _completely_ inviolate
is that associated with a page that is CURRENTLY being displayed in one window
or another.  (That guarantees that "Save As" and "View Source" will work -- the
user can always close that window if they're short of memory.)

Basically, as I first mentioned last year, I believe the history mechanism
should be independent of the cache mechanism instead of the current kludge of
pretending that the cache is an acceptable substitute for a proper history
mechanism.  This is critical to meeting the actual needs of the users, which is
what the application exists for in the first place...

John Keiser (jkeiser)

Comment 36

•

17 years ago

I think the simplest complete solution is a weak reference scheme:

Every history item can have a "reference" to its cached page via a unique ID or 
even via a pointer.  When the user hits a page (or maybe even images within a 
page, if one is so inclined--but that would be considerably harder, involving 
the changing of the page structure), the browser, as always, first searches the 
cache.  If it finds the page, it gets another reference to it (increasing the 
reference count).  If it does not, the browser goes to the network, gets a new 
page, caches it, and gets a reference to the page in the cache.

Then when the browser wants to pull something from the cache, it goes through 
the pointer (or calls the cache using the unique ID).  Multiple places could 
reference the same cache item (two gets that both hit the cache for their data).

When a history item is cleared (i.e. the browser closes or the user goes to a 
URL in the middle of history), the item is dereferenced.

The cache, for its part, guarantees not to remove any item from the cache that 
is referenced.

John Keiser (jkeiser)

Comment 37

•

17 years ago

Deven: I think the reason the cache is used for history is so that you only 
have one copy of the page around, and so that other browsers can search for the 
page (which will certainly use the cache).

I say just modify the cache from LRU to be LRUNRBHI (Least Recently Used page 
that is Not Referenced By a History Item.

Darin Fisher

(Assignee)

Comment 38

•

17 years ago

John: your solution _is_ something we considered doing, and we do have a 
mechanism for ``pinning'' entries in the cache with a reference count.  keeping
hard references to cache entries allows us to keep them ``alive'' even when 
collisions occur.  the cache (really a data store with eviction capabilities)
allows entries to exist in the cache in a doomed state and still be completely
valid from the point-of-view of those owning the hard references.  a doomed
entry is deleted once it's reference count goes to zero (as expected).

but invoking this feature of the cache for all pages ever visited during a
session up to the last N pages, was deemed excessive.  deven may disagree, but
we felt that this would solve the bigger problems such as printing and save as
for content recently visited.

history as we've defined it is really just a list of URLs visited, with form
POST data and some other state remembered.  as a web surfer, i've never found
this to be insufficient.  i mainly use history to go back to revisit an URL 
without having to type it in.  i suspect this is the case for most users.  they
care about the URL, and could care less if the content of the page is not 
exactly as it was before.  so long as it is accurate now, i think most users
will be more than content.

John Keiser (jkeiser)

Comment 39

•

17 years ago

Most Internet users do not care if history goes back to the original page ... 
but Intranet users, especially users of web applications that require 
complicated, multi-page editing and saving of data, will care.

How about this: since it doesn't look that hard to implement (hard references 
already implemented in cache), put the capability in, disable it by default, 
but make it a Preference (checkbox in Advanced > Cache: "Keep history pages in 
cache"--or maybe even "Keep up to n history pages in cache").  This would make 
web applications work real nice but still make normal Internet surfing viable 
in terms of memory.

This would totally avoid the window ID kludge, allow caching to work 100% the 
way everyone wants it, and still wouldn't be awful hard to do.

gordon

Comment 40

•

17 years ago

Enabling multi-page web apps involves a much wider array of issues, which cannot 
be solved simply by changing the caching policy.  With the new cache we have ways 
of avoiding collisions with POST results and Get queries with ?, and we have a 
mechanism (yet to be taken advantage of) to hold cache entries for the currently 
view page to get more accurate Printing, Save As..., and View Source.   It seems 
like this should go a long way in supporting the kind of Intranet web application 
you're talking about John.

We're very interested in providing more support for multi-page web apps, but that 
will involve work in many areas, and if changes need to be made to http and the 
cache to support that work, they will be made.

You are correct John, that the reason the history uses the cache is to avoid 
storing multiple copies of the same document (which in the case of most GETs 
would be the case).  For history to store and manage a separate copy of every 
page (and associated documents) would be very wasteful.  Whether that space is in 
memory or on disk is not the issue.

On a more general note: I believe RFC 2616 is a spec for HTTP/1.1.  I don't 
believe it is a spec for how user agents MUST implement history mechanisms, but 
rather a warning to http implementers of what to expect from user agents.  That 
said, I think we are following the spirit of section 13.13: we certainly aren't 
trying "to show a semantically transparent view of the current state of a 
resource".  However, if our copy of the resource has been evicted (for whatever 
reason), we refetch it from the net, warning the user in the case of POSTs 
because of potential side effects.

Now, bugs in our current validation code is another matter... :-)

John Keiser (jkeiser)

Comment 41

•

17 years ago

As long as we accept as a sad fact of life that we can't keep all the pages 
around (since the user may go to hundreds of pages in a single session), LRU is 
a decent approximation of "give me the last n pages"; even though it will not 
work per-window and will give priority to pages that are recently viewed but 
cleared from history.  We *could* implement a "priority" scheme in cache that 
would help, but it might not be worth it.

On to the issue of GET versus POST and '?' GET requests.  If I understand you 
correctly, you are saying that, for POST and GET w/ '?' entries, we are able to 
store multiple snapshots in time even if the '?' parameters and the POST 
parameters are the same.  But with normal GET we cannot.  I assume that we are 
currently doing something like the following:

1. For GET without '?' we always search the cache based on the URL.  If we 
don't find the entry we go to the network.
2. For POST and GET with '?' we hold some kind of pointer or unique ID to the 
original cache entry so we can get back to it.  If the unique entry is gone 
from cache, then we go back to the network rather than search by URL + 
parameters.  These entries are presumably *not* pinned in the cache, however, 
because that would be inefficient.

If I understand all this, then what prevents us from using the same mechanism 
for GET requests to get a "unique ID" to entries in the cache which are not 
pinned down?  *Then*, if the unique entry is gone from the cache, we search the 
cache based on the URL.  Then if we find nothing we go to the network.

I may be misunderstanding what we do for POST and GET '?' requests.

gordon

Comment 42

•

17 years ago

You are correct about POST and GET ? urls.  However, the unique ID becomes part 
of the key that they are searched for in the cache, because those responses are 
only valid when viewed by history and we never want to find them in the cache 
when visiting a "new" url (clicking a link or typing in the location field).

If we used this approach for ALL GETs, we would never be able to resuse anything 
we put in the cache for "new" visits.

Andreas M. "Clarence" Schneider

(Reporter)

Comment 43

•

17 years ago

We do not need this approach for *all* GETs. If a GET is cachable, subsequent
requests will be served from cache and thus result in indentical pages. But we
should use it for any uncachable GET.

Alec Flett

Updated

•

17 years ago

CC: ? jens-uwe, kc7g → ? alecf, jens

John Keiser (jkeiser)

Comment 44

•

17 years ago

How about a weak reference scheme to deal with this?

When someone requests a document and gets it out of the cache, give them back
the cache ID of the request.  On subsequent attempts to get the document, they
do a call like GetCacheItemByID() which will attempt to find the original item,
and if they cannot find it (if it's expired), they try to find by the URL. 
Basically two different search mechanisms to get at the data.  Maybe even a weak
reference pointer class with a pointer to the nsCacheEntry (or Descriptor?), and
when the cache expires the entry, it goes out to all these weak reference
objects and sets the pointer to NULL.  This could be simpler to implement, and
would be pretty efficient too, I imagine.

When a new copy of a URL is fetched from the network, the old copy is not
destroyed in this case, but instead pushed aside, so that a search by URL will
turn up the new copy.  Perhaps they are ordered in the index by date (most
recently fetched is the one to grab).  This *does* mean that the index in the
list of cache entries will not be unique anymore.

Is this feasible?  Is the index to the cache in memory?  Can the current "key"
be non-unique as long as the search algorithm returns the most recently created
cache entry with that key?

This could be useful elsewhere, I am sure.

I think we may have a blizzard tomorrow, in which case I'd be happy to work on
this.  I've been perusing the cache code.

Hixie (not reading bugmail)

Comment 45

•

17 years ago

I should point out that there is nothing inherently special about the '?' part
of a GET request -- there is no reason why a query couldn't be part of the host
name, in fact.

For example, see:
   http://$sum(42,17,-3).x42.com/
...which is a TOTALLY VALID domain name and triggers a CGI script. There is no
difference between that and
   http://www.example.com/sum?42&17&-3
...which we treat differently (?).

I don't know if this means we have to file a new bug or something...

Deven Corzine

Comment 46

•

17 years ago

I never suggested that space should be wasted with identical copies of data. 
Yes, a naive implementation would waste space on identical copies, but a better
solution would not.  A quote from my 2000-12-28 message under bug 40867: "When
the LRU cache Necko needs happens to point to identical data, the cache manager
could share the memory space."

The desire to avoid redundant copies of data is an excellent reason to have
Necko's LRU cache and the history mechanism depend on a common, independent
cache manager subsystem, but it's NOT such a good reason to have the history
mechanism depend on Necko's LRU cache.  Doing so has caused a number of bugs and
guaranteed that unexpected behavior (e.g. reloading from the network) is always
possible.

I think the reason why these issues have been so complicated was because
functions that should have been independent have always been merged (document
caching mechanisms to manage previously-retrieved documents vs. LRU cache
mechanisms to avoid redundant network transfers).  While they're both caching
functions, one is needed for correctness, the other for efficiency.  By merging
them, we've achieved efficiency without correctness, and caused a wide variety
of problems that could have been avoided by using a cleaner design from the
beginning.  It doesn't make sense to go through the networking layer to retrieve
an in-memory copy of a document that was previously retrieved.  I think the best
solution is to have TWO caches, the LRU cache and an independent cache, with the
LRU cache dependent on the independent cache.

Basically, what I'm suggesting is when data is retrieved (by ANY mechanism,
whether "cacheable" or not, even "file:"), the independent cache manager would
create a new "handle" and return it to Necko, which would save that handle in
the LRU cache for that URL (if it's cacheable), and return the handle to its
caller with the data. If another request comes to Necko for the same URL, and
the handle is in the LRU cache, Necko would fetch the data from the independent
cache manager and return it along with the same handle that was returned before,
since the contents haven't changed.  If the data is fetched again (forced
reload, expired LRU cache entry, etc.), then the new handle would be stored in
the LRU cache after saving the new data in the independent cache.  If Necko's
caller (e.g. a DocShell) later wants a copy of the data again (Back, View
Source, etc.), it would use the handle to request the data from the independent
cache manager, leaving Necko out of the loop.  Therefore, it would ALWAYS
receive exactly the same data for the content in question, or an error if it's
not available -- it would never receive an updated version unexpectedly, since
it would have to ask Necko for an updated copy if necessary.

Ideally, the independent cache should determine if identical content was already
in the cache (when being handed new content by Necko), and avoid saving a
redundant copy, returning the original handle instead.  The independent cache
manager would have the right to move content between memory and disk at will,
retaining the same handle to refer to it.  (This might be done with another LRU
mechanism inside the independent cache, independent of Necko's LRU cache.)  The
handle could either be a hard reference, or a weak reference allowing "locking"
to keep the content from being deleted (e.g. when currently displayed in a
window).  Using a weak reference with locking is probably the most flexible
solution, and allows user-tunable policy (e.g. size limits) to be implemented
for the independent cache manager.

It should be obvious that this two-tier solution would (1) not generally waste
memory on redundant copies of data, yet (2) keep older copies (as well as new
ones) when it's appropriate to do so, and (3) never cause the kind of behavior
that has triggered so many bug reports with the current solution.  While it may
be somewhat more complex, it's also a cleaner design, and I think it's the right
solution.  What value is there in keeping the current design, besides inertia?

John, while I think your "LRUNRBHI" cache might be an improvement, I believe
that separating out the LRU cache from the reference-counted cache is much
cleaner and safer.

Darin, I think you're mistaken in your belief that the current definition of
"history" is adequate, even for average users, but especially for power users
and web developers.  Even the average user is likely to be upset if they can't
go "back" to the content they remember seeing -- if a page changes, and they saw
the old version in their history, they'll rightly expect to be able to go back
and see what the old one looked like.  But forget the average user -- even if
the masses accept this sort of limitation, web developers won't be so forgiving,
and we do WANT web developers to target Mozilla as a preferred platform, don't
we?  If we don't do it right, the web developers won't like working with Mozilla
and that would necessarily impede Mozilla's acceptance.  In itself, that should
be sufficient reason to want to get this exactly right, not just close enough.

If there's ever a preference option for this, the correct behavior should be the
default, not a hidden mode that you need to make an effort to enable.

Gordon, what issues do the multi-page web apps involve that wouldn't be solved
by the two-tier cache architecture I'm suggesting?

Also, I _strongly_ disagree that the current approach follows the spirit of
section 13.13 -- we aren't making an effort to show EXACTLY what the user saw
previously (which is explicitly stated in 13.13, not just implied), and we ARE
showing a "semantically transparent view of the current resource" by refetching
it from the network any time the content has been evicted.  In fact, the ONLY
reason that it doesn't always show the most current state of the resource is
because the cache is used to avoid a network transfer -- when this is correctly
showing exactly what the user saw before, this is by coincidence, not by design.

While RFC 2616 and section 13.13 may not be telling us how we MUST implement
history mechanisms, the guidelines laid down couldn't be more plain, and we're
certainly not following those guidelines in the current implementation, either
in the spirit or by the letter.  Saying that we aren't REQUIRED to do it doesn't
change the fact that these are GOOD guidelines, and we don't really have a good
reason to be violating them -- it's just been done for convenience and out of
some fears of inefficiency.  (And I don't think the right solution need be
inefficient at all.)

Instead of EVER refetching from the network automatically, it would probably be
better to put up a page like NN4 does saying that the previous content has
expired from the cache (regardless of whether it's a POST or not), and require
the user to hit Reload to see the content.  (Maybe for non-POST data that can be
identified as not having been changed since the original retrieval, it might
make sense to automatically retrieve the page again, but only if it's fairly
certain that the content is unchanged.  If so, it should be possible to disable
this with a preference option.)

As for an approach that can be used on *all* GETs, the two-tier cache approach
I'm suggesting would work fine when applied to all GET operations.  Any time the
cache would have returned the contents, a shared handle will be used.  If the
content is non-cacheable or has expired, it will be fetched again as expected,
but old copies will remain available if they are still in use.  This approach
doesn't require the browser to second-guess the server based on using POST or
GET with ? to guess whether content is dynamic; it would just work regardless.

Okay, so tell me.  What's wrong with my suggestion?

John Keiser (jkeiser)

Comment 47

•

17 years ago

I think your solution is a good idea.  Having two levels of the cache separates 
out some stuff that I was combining in mine.  What I was proposing was a single 
cache that you can search both based on a unique ID or reference, and a URL.  
You propose a data store indexable by unique ID/reference, and a separate 
search mechanism; this independence would have three major benefits:

1. It makes it easier to understand: the real "cache" is just a way of storing 
data that can be expired, and all other ways of accessing it are just views 
into it.
2. It would make the cache easily searchable in other ways down the road
3. It would make it possible to make the URL search more efficient by not 
placing (for example) POSTs and '?' GETs in the search list, since they are not 
ever really searched by URL anyway.  Smaller data size means faster search.

I agree that it should be possible for two copies of the same URL to be in the 
cache.  From that point of view, the cache needs a little work, IMO.  It should 
just be a store with a bunch of entries which can be expired.

I disagree, however, that the browser should use this feature to its limit, and 
require the cache to keep a copy of every single page in history until the 
history item goes away.  The Netscape guys here have a very good point, in that 
users could *easily* visit hundreds of pages in the history of a single browser 
window.  If anything, the cache should assign priority to those pages that are 
most recent in browser history.

And while we're talking about content that should be kept around for history 
lists, if we want to be strict about the RFC, we have to keep all *images* 
around too.  What if they are graphs, for example, dependent on time--like 
perhaps an up-to-the-minute stock chart?  Now we're talking an extreme memory 
burden.

Memory has to be a factor.  However, it could be OK (for web developers' sake) 
to turn on the "keep all pages around" feature as a Pref, or (probably 
better) "keep n pages in history around".  Then the question becomes whether to 
keep the images too.

John Keiser (jkeiser)

Comment 48

•

17 years ago

Oh, one more thing that makes a separation of the URL search index from the 
data store itself desirable: the HtTP directive that says not to cache the 
item.  This way, you can keep the non-cached item around in the data store and 
just not keep the index in the URL search.

The expiration policy could be improved, too.  The first thing you get rid of 
from the cache (you *always* get rid of) are entries that are non-searchable 
(like POST or '?' GET or no-cache entries) and which do not have any references 
to them.  Really, you could safely get rid of these as soon as the reference 
count goes to 0.

When an entry expires from the cache (the directive that says cache for x 
minutes) the entry becomes a no-cache entry, too, and is removed from the 
search list.

Darin Fisher

(Assignee)

Comment 49

•

17 years ago

ian: the '?' behavior comes as a recommendation from rfc 2616 on the basis
of maintaining compatibility with existing browser behavior and especially
older servers.  we would actually be ``correct'' in not handling '?' GETs
any differently from normal GETs... that is, we should be able to do the
right thing just based on the values of the headers; however, some older
servers may not send the correct headers, so we chose to heed the rfc's
recommendation on this point.

Hixie (not reading bugmail)

Comment 50

•

17 years ago

Darin: ok, that's cool. So basically what we do is simply ignore the cache
headers for GET requests with '?' in the URI? If so, then that's fair enough (I
was concerned that we might be doing the opposite, namely ignoring the spec in
all cases rather than just one). Cheers!

Andreas M. "Clarence" Schneider

(Reporter)

Comment 51

•

17 years ago

Ian: I don't know exactly what we are doing, but RFC 2616 says in section 13.9:
"caches MUST NOT treat responses to such URIs [containing "?"] as fresh unless
the server provides an explicit expiration time".
I'm not happy with that, but it is the spec.

Darin Fisher

(Assignee)

Updated

•

17 years ago

Depends on: 75679

Deven Corzine

Comment 52

•

17 years ago

Yes, the user could easily visit hundreds of pages in a single browser window. 
So what?  That's no reason to architect the system not to save them all -- it's
a reason to include user-tunable preferences to control the policy on how much
to keep for how long.  I did mention that another LRU mechanism could exist in
the independent cache manager for this sort of policy-based expiration, which
would be unrelated to Necko's LRU cache mechanism.

Keep in mind that costs of memory and disk space keep dropping, and as a user,
I'd rather have hundreds of history pages cached than to have the space wasted.
 And if I can't afford to use the space for caching, I'll set the preferences so
that the policy will be only to keep a limited amount of data.  I'm only
recommending that the current document in each window be locked in the cache;
any history document not currently displayed would be a possible candidate for
eviction based on the policy preferences.

In general, I'd suggest evicting cacheable content FIRST, since it's more likely
to be unchanged of it must be reloaded from the network.  Content that cannot be
cached is more likely to be dynamic content, and therefore more important to
keep around for history pages.  After giving non-cacheable content priority, a
LRU mechanism could be used to evict the least recently used pages first. 
Perhaps a ranking scheme that balances the time of last access against dynamic
content and maybe size would be the best solution.  (Size because expiring a few
large documents might avoid the need to expire many small ones.)  Regardless,
all of these are expiration policy mechanisms, and (as with Usenet) there's no
single solution that everyone will find acceptable, so tunable parameters are
the most appropriate solution.

There's good reason to store EVERY page retrieved in the independent cache,
whether or not it belongs in Necko's LRU cache.  But just because it gets stored
there doesn't mean it has to stay indefinitely just because it's well back in
the user's history.  For one thing, if the history itself has a fixed limit
(hardcoded to 50 right now?), this would provide one limiting factor.  (Such a
limit should, of course, be possible to tune or disable in the user's prefs.) 
Otherwise, pages still "available" in the history could be expired early if
necessary, and then perhaps reloaded from the network if needed later, with or
without user interaction according to the prefs.

It would make sense to remember whether or not an evicted page was cacheable,
since a reasonable default would be to reload from the network transparently for
static data that was expired early (with a warning message in the status bar),
and put up a "press reload" page for dynamic content (as NN4 does with POST). 
(I'm not sure offhand if any other metadata would be worth saving.)

The independent cache could also potentially store the associated DOM tree for
some of the saved pages -- this would have to be expired quickest from the cache
(without expiring the source), but for machines with sufficient memory, this
could be a BIG performance win for very recently-accessed pages.  Just think how
impressive it would be if the first few times you press "Back" could be drawn
nearly as fast as incremental reflows happen when you resize a window now...

Darin Fisher

(Assignee)

Comment 53

•

17 years ago

the fix for the original bug report has been checked in along with the patch
for bug 75679.  this does not include the RFE to make history keep around "old"
content... a separate bug, filed against the history component, should be opened
to track that RFE.

Status: ASSIGNED → RESOLVED

Last Resolved: 17 years ago → 17 years ago

Resolution: --- → FIXED

Deven Corzine

Comment 54

•

17 years ago

The quote from RFC 2616 in the original description of this bug merely codifies
the actual expectations of the user -- to the extent that these guidelines have
been violated, the user can also expect to view the behavior as incorrect (i.e.
a bug).  I can't speak for the original reporter, but the quote suggests that
this bug was intended to cover ALL of the deviations from the quoted RFC
guidelines, not merely the most egregious ones.

If the patch hasn't addressed the problem of returning old versions of content
or retaining content for history when the user flushes the cache, then it would
appear that this bug is NOT fixed.  If this bug is under the wrong component,
then change it, but don't represent a partial fix as a complete one!  Finishing
the job is hardly an "enhancement"...

Darin Fisher

(Assignee)

Comment 55

•

17 years ago

deven: my concern is first and foremost that of providing parity with existing
browser behavior.  i feel that this has been done, and with it the critical'ness
of this issue has been addressed.  you may argue that more work remains to be
done, but in my mind your talking about an enhancement to the current browser
requirements/feature-set.  so, please feel free to file a new bug to track the
new feature you describe.  i suggest filing it against the history component, as
the new cache already provides support for holding hard references to stored
content.

Deven Corzine

Comment 56

•

17 years ago

Why not leave this bug open and reduce the severity to normal?  After all, the
description still encompasses the remaining problems, but (as you point out)
it's no longer as critical as before.  If the history component makes more
sense, change that too.  All I'm saying is that we shouldn't mark the entire bug
as fixed just because a partial fix made the remainder less critical.

The problem with closing this bug and making a new one is that it isn't really
equivalent -- this bug has history (in the comments) and people who have marked
themselves as interested (via CC's and/or votes) that would not be associated
with a newly-filed bug.  If you leave the bug open but reduce the severity and
change the component, all of that history will remain intact.  And people who
are no longer interested in the remaining parts of the bug can always remove
themselves from the CC list or cancel their vote.

In general, this seems like a cleaner way to handle partial fixes that address
the critical aspects of a bug -- there seems to be a preference for closing bugs
entirely and generating new reports for old problems, which seems like a poor
solution to tracking an old problem...

Deven Corzine

Comment 57

•

17 years ago

Oh, and I'm still not talking about an enhancement -- I'm talking about a bug
that happens to also exist in previous browsers like NN4.  Bug parity does not
an enhancement make.

Alec Flett

Comment 58

•

17 years ago

I agree...file a new bug, against Session History (not Global History)

Brendan Eich [:brendan]

Comment 59

•

17 years ago

Deven: don't morph this bug.  Do cite it via "bug 56346" or "bug #56346" or
equivalent (bugzilla will linkify).  You're describing a different bug, in a
different component.

/be

Akkana Peck

Comment 60

•

17 years ago

With recent builds, if I type some comments in a bugzilla page and an error in
the cc line, click Commit, get the error page telling me to go back, and do so,
I reload the page with none of my changes there -- whatever comments and changes
I made are lost forever.  A week ago, comments were remembered.  Is that a side
effect of this cache fix?  Is it a different bug?  Already filed?  It's a
serious usability regression for people who use bugzilla a lot.

Darin Fisher

(Assignee)

Comment 61

•

17 years ago

akkana: i'm pretty certain that that bug was around before i landed.

rpotts (gone)

Comment 62

•

17 years ago

This sounds more like frameset restoration (ie. layoutHistoryState) stuff than 
the cache to me :-)

-- rick

Radha on family leave (not reading bugmail)

Comment 63

•

17 years ago

I'm looking in to the problem with form value restoration in bugzilla. bug 74639 
already addresses this bug

Radha on family leave (not reading bugmail)

Comment 64

•

17 years ago

*** Bug 76150 has been marked as a duplicate of this bug. ***

Deven Corzine

Comment 65

•

17 years ago

It seems that bug 55583 (view-source should show original source) may be the
best place to discuss the remaining issues at this point...

rpotts (gone)

Updated

•

17 years ago

Depends on: 90722