Two separate and distinct objects are important for Numeric and will be discussed separately. These should be kept as separate as possible so that they can be accepted separately into the Python Core.
I think the arrayobject can be finished within a month (end of February) Then, it should be pushed for inclusion in the Python core. The case for this will be made stronger if numarray Record Arrays can inherit from it and be made to work. My initial review shows that it can, so I am hopeful.
The ufuncobject will probably take a couple more weeks beyond the arrayobject to finish.
Multidimensional Array Object
The attitude of this document is that Numeric does a good job filling this role. It needs only some minor modifications to make it fulfull the needs that have been identified.
The modifications to Numeric are each described in the next sections:
- Separate the Ufunc object completely from the arrayobject.
This probably needs to be done so that the array object can be placed in the Python core without a particular implementation of the more controversial ufuncs enshrined. It should be easy to change what the array object does for number methods. Whether this is done through an API call (like currently done only expanded so that individual calls can be replaced as desired) or through subclassing can be discussed.
Right now, I personally favor an API call so that it discourages the use of multiple array number behavior. But, maybe it is a good idea instead to encourage subclassing. For now, the API method of altering the ufuncs will be implemented as that is what is basically done now.
Also, the _numpy module should be eliminated and the appropriate segments moved to multiarraymodule.c and umathmodule.c
- Make the Array Object a new base type
This is straight forward and will help answer questions of speed of new-style C-types. What PyArrayObject structure members should be available for alteration is a valid question (the getsets table). This needs to be totally in C with a good API.
- Add all basic C-types
This include long double, complex long double, unsigned long, long long, unsigned long long, bool, and unicode character arrays. I favor naming these as is currently done, but defining a set of enums and typedefs that make available the Int8, Int16 names to the C programmer. This means a lot of #define statements at compile time to figure out what the precision of the platform is. I also would like to see a Pint defined as an integer that can hold a pointer for the platform. This should be mapped to one of the underlying types as well.
- bool-based masking when object in brackets is a bool array
Like numarray does, a[B]? should return a copy of the elements of the array at the position of the mask. a[B]? = x should assign x to those locations of B which are a 1. If x is an array then it will be treated as a 1-d array for the purpose of getting values to assign.
- index-based slicing when object in brackets is an integer array (or a list?)
When the object inside brackets is any kind of integer array it should be treated as 1-D indexes into a. Multidimensional slicing should also be available and be done as numarray does it.
- Make Character Arrays
The PyArray_CHAR and PyArray_UCHAR types need to be character (string) arrays which can have arbitrary itemsizes and not be assumed as one-element characters anymore. As is, they are never used in Numeric, while the Numarray-introduced idea of Character Arrays looks useful. This will only involve a few changes including the itemsize being looked for in the array instead of in the descriptor.
- Modifications needed for Record Arrays
The PyArrayObject structure needs to hold the itemsize variable and a new PyArray_VOID type needs to be defined. This type will not do much but will be able to be sliced and manipulated as needed. Code that uses itemsize needs to be fixed to use the array's value.
- New flags
In order to facilitate de-referencing records stored in the new VOID type, the array flags will need to be extended with the Numarray ALIGN and BYTESWAPPED flag. Code that assumes data is aligned or notswapped will need adjusting.
- Iter Object
An iterator object needs to be defined to walk through the array. This should work for an arbitrary array (misaligned, byteswapped, etc.)
- Move type-specific functions to PyArray_Descr structure
All of the type-specific functions in multiarraymodule (compare, argfunc, dot) are moved to be a member of the PyArray_Descr structure.
- Enhanced delegation
Many of the Array functions defined in multiarraymodule will be enhanced to delegate to an appropriately named method (this will also be available throught the C-API). All internal functions for VOID arrays (getitem, setitem, compare, argfunc, dot) will look to an object pased in for a specific method to implement
- Open Ideas
- Should several types be defined (i.e 0d, 1d, 2d, 3d, nd)?
I don't think this needs to be done and may be too big of a risk given the early state of subtyping in C. Optimizations for contiguous arrays can be placed throughout the code instead.
- Should the dimensions, and strides memory locations be fixed to
MAX_DIMS or possibly allocated using a pre-allocated chunk of space?
I favor the current scheme with possibly a pre-allocated chunk of space being used for fast malloc and free. The downside is we would have to write an equivalent malloc and free and I'm not sure how to do that well. I'm opposed to static allocation of dimensions and strides to due largely to the thought of "wasted" space for the many small arrays case Applications with 12000 1-d arrays, for example would waste (12000MAX_DIMSsizeof(pint)) which is about 2MB or 4MB
- Should the PyArray_Descr structure be made into a Python type?
- Should several types be defined (i.e 0d, 1d, 2d, 3d, nd)?
I've gone back and forth on this. Right now, I think it would add too much complexity to the code base and would move us too far away from the current code base for what is needed. We don't need to do this to make RecordArrays work, so I favor postponing doing this until later.
This object implements the number protocol methods. It should be completely separate from the array object. In Numeric 3.0 I would change
- Handle contiguous arrays specially
- Allow for misalignment, and byteswapped arrays using temporaries --- perhaps handle discontiguous arrays using temporaries as well.
- Use small temporary arrays for typecasting like numarray -- don't typecast the entire array.
- Keep the ufuncobject essentially as is.
- Change the coercion model to numarray-like.
Scalars should not cause coercion except from int to float or float to complex.
- Make a Python call to add a ufunc object
perhaps one that takes a CPointer (could use distutils to validate a code snippet with the CPointer to make sure it will work).
- Support subclasses of arrays.
If an arbitrary object is found (not an array or subclass) call it's method by the same name. Think about using the Priority scheme that numarray uses.
- Support IEEE exceptions / eliminate current CHECK. Use methods of numarray.
- Other Open Ideas.
- adapting software ufuncs at compile time. I think we could easily do a little bit here (like checking for which of 3-4 unrolled loops works best, or what the best buffer size for temporary arrays is), but I'm not sure how to do this really well (Eric?)
- supporting the idea of chained ufuncs (perhaps using + or %) on ufuncs. This would be a method for "chaining" ufuncs together to eliminate the need for full temporary arrays when parsing complicated math expressions. Basically by "adding" ufuncs together, a super ufunc is created that performs the entire expression by creating intermediate buffers. In other words instead of abc + d you would enter ((multiply + multiply) + add)(a,b,c,d) (or enter ufunc("abc+d") which would parse the expression for you and build up the necessary chained ufunc). Only intermediate temporary buffers would every be created.
This area is for general issues to be raised (and addressed).
- Will use of byteswapped or misaligned arrays force a whole array copy or will the buffer mechanism you mention for typecasting be used for these as well?
Byteswapped or mis-aligned arrays will use a buffer mechanism (like numarray does), so that copies are only done as needed or requested.
- Will this support memory mapped arrays?
Yes. The data buffer can point to a memory-mapped segment (and the array set as READONLY if needed),
- Will this support writing to memory mapped arrays (particularly if they are byteswapped) and without whole array copies?
Yes. all writes to the array will not assume contiguous, aligned, or nonswapped memory.
- Will there be support for 64-bit indices and array dimensions
Yes. I've already defined a pint type in C which is the integer type that is large enough to hold a pointer on the platform. Currently this is the default size of indices and array dimensions and strides.
- Is having only an array object sans ufuncs important in the core?
(Perry) This addresses the point I made that it will encourage its use as a data interchange format. Nevertheless, most of the scientific community will find it nearly useless without installing ufuncs, and to that end, its being part of the core so that more installs don't have to be hasn't been satisified. So this point should be made more clearly, that this part being part of the core will encourage others to use arrays as a data format, but doesn't forestall having to install more stuff to make it usable at all scientifically.
I'm not sure. I agree that for scientific use, an array object without the number methods is useless. I just recall Guido mentioning that he may not accept "complicated ufunc stuff" and so I thought by separating these into separate objects as much as possible, we could tackle getting them into the core separately. The separation will not be noticeable to the average user (unless they want to replace certain operations math --- which is easier to do).
(Perry) Other issues, not critical technically, or even for our uses (though I have my opinions!), but would be good to define clearly (there may be other in the existing Numeric or numarray communities that may care more than us: if you aren't clear about these, I suspect it will cause grief later.):
- What does single element indexing return? Scalars or rank-0 arrays?
Right now, a scalar is returned if there is a direct map to a Python type, otherwise a rank-0 array (Numeric scalar) is returned. But, in problems which reduce to an array of arbitrary size, this can lead to a lot of code that basically just checks to see if the object is a scalar. There are two ways I can see to solve this: 1) always return rank-0 arrays (never convert to Python scalars) and 2) always use special functions (like alen) that handle Python scalars correctly. I'm open to both ideas, but probably prefer #1 (never convert to Python scalars) unless requested.
- Can arrays be used as truth values directly?
(Perry) I think allowing it is error prone
(Travis) Probably not except for arrays that have only one element. I agree that otherwise it is not well defined and should therefore raise an error forcing the user to be specific about what they mean
- The behavior of rank-0 arrays regarding indexing and len()
(Travis) I'm open to ideas. I currently favor len() returning 1 and indexing to raise IndexErrors, but I'm suggestable.
- Default axis
(Travis) I'm not sure that anything should be changed here at this point
(Travis) This should definitely be fixed. It's a bug that these return different types.
- How array types are handled at the Python level (type objects vs character codes) and the typecode vs type keyword
(Perry) I've never liked the character codes at the python level; what C uses is a different story
(Travis) As far as keyword goes, I suppose we should be consistent. I would favor using "type". Character codes can be confusing. I agree that the Python interface should continue to support the Int16, etc. objects. But, the use of character codes for describing different data types has a long history in Python and shouldn't disappear (it's done that way in several other modules like struct and array).
(Travis) I support changing the character codes to be consistent with the struct module and array modules where possible. I also support the idea of using names like Short, Int, Long, Llong, and so forth in Python instead of only Int16, and so forth, because IntNN does not translate to the same c type on every platform, and often people want a certain c type (they don't know it's bit width). Borrowing from numarray ideas, the new arrayobject.h header file defines IntNN types in C by determing the smallest c type equal to that bit-width.