(logo)
(navigation image)
Home Wayback Machine | Blog | Researcher Access | FreeCache | Community Wireless | Petabox | Heritrix | Open Source Media | BookMobile

Search: Advanced Search

Anonymous User (login or join us)Upload
 Reference Links
Researcher access is currently not available pending redesign. This material has been retained for reference and was current information as of late 2002.

Data Available
Tools Available
Example Projects
Tool Documentation
Example Code
Tool Documentation
The following is a list of tools created for the archive to enhance and simplify data mining tasks.

av_ah
Shortcut for av_append HostName

av_alpha_section
Return section of data according to alphabetic criteria. Assumes input is sorted.

av_ap
Shortcut for av_append $P.

av_append

Append some fixed data as last item in output.

av_arcfilter
Move pages to stdout only if they match certain criteria.

av_bench
Print a single line, the host and the Mb/sec for “cp”-ing to each directory.

av_break

Transforms all spaces into newlines. Reads from stdin, writes to stdout.

av_cat
Copy a bunch of files to stdout, asynchronously.

av_chunk
Copy stdin to a series of files with the given size and prefix. If postPrefix is specified each file is renamed after being closed.

av_countchar
Count occurances of a single byte value in a file.

av_cw

Compress whitespace.

av_dedupcdx

av_diff
Just like diff.

av_divvy
Distribute the lines of the input files over the output files.

av_dup
Duplicate stdin number of files.

av_explore
Interactive program to explore an index. Resolves queries and a whole lot more. See AV_Explore FAQ

av_getlastline
Print the last line of the file to stdout, then truncates the file to remove that line.

av_getpage

Extract individual web documents from a compressed .arc or .dat file.
This tool expects a filename and an offset. The filename should include the full path.

av_grep
Exactly like grep, but returns a status of zero when no matches are found.

av_isazip
Returns zero for an Alexa zip file, non-zero otherwise.

av_join
Perform an equality join on two files.

av_kill
Kill all processes that match a pattern.

av_makelibpthread

av_overwrite

av_ph

Prepends hostname as first item in output

av_pp
Shortcut for av_prepend.

av_prepend
Add some fixed data to the beginning of each line.

av_prepend_random
Prepend a random number to the start of each line.

av_procarc

av_putlastline
Append the line specified to the file, opens the file exclusive to prevent collisions.

av_randomize
Randomize the order of lines in a file.

av_rearc
Re-write arc files.

av_rearcfull
Re-write all old arc files on a machine.

av_remdt
Make a fresh complete.cdx.gz file.

av_reverse
Reverse the order of characters in each line.

av_sample
Return every N'th line of the input file.

av_savearc

av_search
Undo a line break.

av_section
Read a section of a text file.

av_sort
Same semantics as sort, but much faster for large files.

av_spew
Executes rsh host cat file if it can deduce the host and cat file otherwise.

av_split
Split the contents of one or more files into some other files

av_strip_space
Remove leading and trailing blanks and, optionally, blank lines.

av_strip_url
Strip urls of extraneous content.

av_unbreak
Unbreak the line.

av_uniq

Remove duplicate lines from input.

av_ziparc
Zip an entire arc file. Expects .arc file, returns .arc.gz file.

av_ziplines
Zip individual lines of an arc file.

bin_search

Do a binary search on a sorted text file. Return one or all matching lines.

Terms of Use (10 Mar 2001)