|
Tool Documentation |
The
following is a list of tools created for the archive to enhance
and simplify data mining tasks.
av_ah
Shortcut for av_append HostName
av_alpha_section
Return section of data according to alphabetic criteria. Assumes
input is sorted.
av_ap
Shortcut for av_append $P.
av_append
Append some fixed data as last item in output.
av_arcfilter
Move pages to stdout only if they match certain criteria.
av_bench
Print a single line, the host and the Mb/sec for cp-ing
to each directory.
av_break
Transforms all spaces into newlines. Reads from stdin, writes
to stdout.
av_cat
Copy a bunch of files to stdout, asynchronously.
av_chunk
Copy stdin to a series of files with the given size and prefix.
If postPrefix is specified each file is renamed after being
closed.
av_countchar
Count occurances of a single byte value in a file.
av_cw
Compress whitespace.
av_dedupcdx
av_diff
Just like diff.
av_divvy
Distribute the lines of the input files over the output
files.
av_dup
Duplicate stdin number of files.
av_explore
Interactive program to explore an index. Resolves queries
and a whole lot more. See AV_Explore
FAQ
av_getlastline
Print the last line of the file to stdout, then truncates
the file to remove that line.
av_getpage
Extract individual web documents from a compressed .arc or .dat
file.
This tool expects a filename and an offset. The filename should
include the full path.
av_grep
Exactly like grep, but returns a status of zero when no
matches are found.
av_isazip
Returns zero for an Alexa zip file, non-zero otherwise.
av_join
Perform an equality join on two files.
av_kill
Kill all processes that match a pattern.
av_makelibpthread
av_overwrite
av_ph
Prepends hostname as first item in output
av_pp
Shortcut for av_prepend.
av_prepend
Add some fixed data to the beginning of each line.
av_prepend_random
Prepend a random number to the start of each line.
av_procarc
av_putlastline
Append the line specified to the file, opens the file exclusive
to prevent collisions.
av_randomize
Randomize the order of lines in a file.
av_rearc
Re-write arc files.
av_rearcfull
Re-write all old arc files on a machine.
av_remdt
Make a fresh complete.cdx.gz file.
av_reverse
Reverse the order of characters in each line.
av_sample
Return every N'th line of the input file.
av_savearc
av_search
Undo a line break.
av_section
Read a section of a text file.
av_sort
Same semantics as sort, but much faster for large files.
av_spew
Executes rsh host cat file if it can deduce the host and
cat file otherwise.
av_split
Split the contents of one or more files into some other
files
av_strip_space
Remove leading and trailing blanks and, optionally, blank
lines.
av_strip_url
Strip urls of extraneous content.
av_unbreak
Unbreak the line.
av_uniq
Remove duplicate lines from input.
av_ziparc
Zip an entire arc file. Expects .arc file, returns .arc.gz file.
av_ziplines
Zip individual lines of an arc file.
bin_search
Do a binary
search on a sorted text file. Return one or all matching lines.
|
|