COLLECTED BY
Organization:
Internet Archive
The Internet Archive discovers and captures web pages through many different web crawls.
At any given time several distinct crawls are running, some for months, and some every day or longer.
View the web archive through the
Wayback Machine.
Web wide crawl with initial seedlist and crawler configuration from April 2013.
[R] Fwd: conditionally merging adjacent rows in a data frame
Marek Janad
marek.janad at gmail.com
Thu Dec 10 00:41:22 CET 2009
I've also made some comparisons and taking into account execution
time, sqldf wins. SummaryBy is better then aggregate in some specific
situations I met in practice. I present this situation below. It
assumes, that there are at least two groups with high number of
levels.
n<-100000;
grp1<-sample(1:750, n, replace=T)
grp2<-sample(1:750, n, replace=T)
d<-data.frame(x=rnorm(n), y=rnorm(n), grp1=grp1, grp2=grp2, n, replace=T)
# sqldf
library(sqldf)
Rprof('prof');
sqldf("select grp1, grp2, avg(x), avg(y) from d group by grp1, grp2")
Rprof(NULL);
summaryRprof('prof')
#by
#do.call(rbind, by(d, list(d$grp1, d$grp2), function(x) transform(x, x
= mean(x), y = mean(y))[1,,drop = FALSE ]))
#doBy
library(doBy)
Rprof('prof');
summaryBy(x+y~grp1+grp2, data=d, FUN=c(mean))
Rprof(NULL);
summaryRprof('prof')
#aggregate
Rprof('prof');
aggregate(d, list(d$grp1, d$grp2), function(x)mean(x))
Rprof(NULL);
summaryRprof('prof')
---------- Forwarded message ----------
From: Nikhil Kaza <nikhil.list at gmail.com>
Date: 2009/12/9
Subject: Re: [R] conditionally merging adjacent rows in a data frame
To: Titus von der Malsburg <malsburg at gmail.com>
DW: r-help at r-project.org
This is great!! Sqldf is exactly the kind of thing I was looking for,
other stuff.
I suppose you can speed up both functions 1 and 5 using aggregate and
tapply only once, as was suggested earlier. But it comes at the
expense of readability.
Nikhil
On 9 Dec 2009, at 7:59AM, Titus von der Malsburg wrote:
> On Wed, Dec 9, 2009 at 12:11 AM, Gabor Grothendieck
> <ggrothendieck at gmail.com> wrote:
>>
>> Here are a couple of solutions. The first uses by and the second sqldf:
>
> Brilliant! Now I have a whole collection of solutions. I did a simple
> performance comparison with a data frame that has 7929 lines.
>
> The results were as following (loading appropriate packages is not included in
> the measurements):
>
> times <- c(0.248, 0.551, 41.080, 0.16, 0.190)
> names(times) <- c("aggregate","summaryBy","by+transform","sqldf","tapply")
> barplot(times, log="y", ylab="log(s)")
>
> So sqldf clearly wins followed by tapply and aggregate. summaryBy is slower
> than necessary because it computes for x and dur both, mean /and/ sum.
> by+transform presumably suffers from the contruction of many intermediate data
> frames.
>
> Are there any canonical places where R-recipes are collected? If yes I would
> write-up a summary.
>
> These were the competitors:
>
> # Gary's and Nikhil's aggregate solution:
>
> aggregate.fixations1 <- function(d) {
>
> idx <- c(TRUE,diff(d$roi)!=0)
> d2 <- d[idx,]
>
> idx <- cumsum(idx)
> d2$dur <- aggregate(d$dur, list(idx), sum)[2]
> d2$x <- aggregate(d$x, list(idx), mean)[2]
>
> d2
> }
>
> # Marek's symmaryBy:
>
> library(doBy)
>
> aggregate.fixations2 <- function(d) {
>
> idx <- c(TRUE,diff(d$roi)!=0)
> d2 <- d[idx,]
>
> d$idx <- cumsum(idx)
> d2$r <- summaryBy(dur+x~idx, data=d, FUN=c(sum,
> mean))[c("dur.sum", "x.mean")]
> d2
> }
>
> # Gabor's by+transform solution:
>
> aggregate.fixations3 <- function(d) {
>
> idx <- cumsum(c(TRUE,diff(d$roi)!=0))
>
> d2 <- do.call(rbind, by(d, idx, function(x)
> transform(x, dur = sum(dur), x = mean(x))[1,,drop = FALSE ]))
>
> d2
> }
>
> # Gabor's sqldf solution:
>
> library(sqldf)
>
> aggregate.fixations4 <- function(d) {
>
> idx <- c(TRUE,diff(d$roi)!=0)
> d2 <- d[idx,]
>
> d$idx <- cumsum(idx)
> d2$r <- sqldf("select sum(dur), avg(x) x from d group by idx")
>
> d2
> }
>
> # Titus' solution using plain old tapply:
>
> aggregate.fixations5 <- function(d) {
>
> idx <- c(TRUE,diff(d$roi)!=0)
> d2 <- d[idx,]
>
> idx <- cumsum(idx)
> d2$dur <- tapply(d$dur, idx, sum)
> d2$x <- tapply(d$x, idx, mean)
>
> d2
> }
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
--
Marek
More information about the R-help
mailing list