COLLECTED BY
Organization:
Internet Archive
The Internet Archive discovers and captures web pages through many different web crawls.
At any given time several distinct crawls are running, some for months, and some every day or longer.
View the web archive through the
Wayback Machine.
Web wide crawl with initial seedlist and crawler configuration from April 2013.
[R] conditionally merging adjacent rows in a data frame
Titus von der Malsburg
malsburg at gmail.com
Wed Dec 9 13:59:50 CET 2009
On Wed, Dec 9, 2009 at 12:11 AM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> Here are a couple of solutions. The first uses by and the second sqldf:
Brilliant! Now I have a whole collection of solutions. I did a simple
performance comparison with a data frame that has 7929 lines.
The results were as following (loading appropriate packages is not included in
the measurements):
times <- c(0.248, 0.551, 41.080, 0.16, 0.190)
names(times) <- c("aggregate","summaryBy","by+transform","sqldf","tapply")
barplot(times, log="y", ylab="log(s)")
So sqldf clearly wins followed by tapply and aggregate. summaryBy is slower
than necessary because it computes for x and dur both, mean /and/ sum.
by+transform presumably suffers from the contruction of many intermediate data
frames.
Are there any canonical places where R-recipes are collected? If yes I would
write-up a summary.
These were the competitors:
# Gary's and Nikhil's aggregate solution:
aggregate.fixations1 <- function(d) {
idx <- c(TRUE,diff(d$roi)!=0)
d2 <- d[idx,]
idx <- cumsum(idx)
d2$dur <- aggregate(d$dur, list(idx), sum)[2]
d2$x <- aggregate(d$x, list(idx), mean)[2]
d2
}
# Marek's symmaryBy:
library(doBy)
aggregate.fixations2 <- function(d) {
idx <- c(TRUE,diff(d$roi)!=0)
d2 <- d[idx,]
d$idx <- cumsum(idx)
d2$r <- summaryBy(dur+x~idx, data=d, FUN=c(sum,
mean))[c("dur.sum", "x.mean")]
d2
}
# Gabor's by+transform solution:
aggregate.fixations3 <- function(d) {
idx <- cumsum(c(TRUE,diff(d$roi)!=0))
d2 <- do.call(rbind, by(d, idx, function(x)
transform(x, dur = sum(dur), x = mean(x))[1,,drop = FALSE ]))
d2
}
# Gabor's sqldf solution:
library(sqldf)
aggregate.fixations4 <- function(d) {
idx <- c(TRUE,diff(d$roi)!=0)
d2 <- d[idx,]
d$idx <- cumsum(idx)
d2$r <- sqldf("select sum(dur), avg(x) x from d group by idx")
d2
}
# Titus' solution using plain old tapply:
aggregate.fixations5 <- function(d) {
idx <- c(TRUE,diff(d$roi)!=0)
d2 <- d[idx,]
idx <- cumsum(idx)
d2$dur <- tapply(d$dur, idx, sum)
d2$x <- tapply(d$x, idx, mean)
d2
}
More information about the R-help
mailing list