- Thread starter trinker
- Start date
- Tags analysis data fundamental graphics

I think you'll find this kind of info in these places which has a rich history of talkstats users:

http://www.talkstats.com/showthread.php/18603-Share-your-functions-amp-code

http://www.talkstats.com/showthread.php/5246-Info-for-R-users-(Links-Manuals-Books-etc.)

http://www.talkstats.com/showthread.php/22597-Today-I-Learned-____

http://www.talkstats.com/showthread.php/29320-R-Graphics-Beautiful-graphics-thread

http://www.talkstats.com/showthread.php/18603-Share-your-functions-amp-code

http://www.talkstats.com/showthread.php/5246-Info-for-R-users-(Links-Manuals-Books-etc.)

http://www.talkstats.com/showthread.php/22597-Today-I-Learned-____

http://www.talkstats.com/showthread.php/29320-R-Graphics-Beautiful-graphics-thread

I didn't mean to be "parading". Just sharing tips and tricks which helped me as a learner. How do I merge this thread into the "Share your functions and code"? Please advise.

But it's limited for sure: http://stackoverflow.com/q/25355310/1000343

I used the

Code:

```
hijack <- function (FUN, ...) {
.FUN <- FUN
args <- list(...)
invisible(lapply(seq_along(args), function(i) {
formals(.FUN)[[names(args)[i]]] <<- args[[i]]
}))
.FUN
}
.data.frame <- hijack(data.frame, check.names = FALSE, stringsAsFactors = FALSE)
(dat <- data.frame(`bad name` = 1:3, x2 = c("a", "b", "c"))); str(dat)
[COLOR="silver"]## bad.name x2
## 1 1 a
## 2 2 b
## 3 3 c
## 'data.frame': 3 obs. of 2 variables:
## $ bad.name: int 1 2 3
## $ x2 : Factor w/ 3 levels "a","b","c": 1 2 3[/COLOR]
(dat2 <- .data.frame(`bad name` = 1:3, x2 = c("a", "b", "c"))); str(dat2)
[COLOR="silver"]## bad name x2
## 1 1 a
## 2 2 b
## 3 3 c
## 'data.frame': 3 obs. of 2 variables:
## $ bad name: int 1 2 3
## $ x2 : chr "a" "b" "c"[/COLOR]
```

https://github.com/trinker/qdapRegex/tree/master/R

I just hijack one function and it makes things pretty easy. It makes me feel like a ninja

Code:

```
ransom <- function(text, open=TRUE) {
root <- "http://contactsheet.org/cgi-bin/ransom.pl?thedata="
text <- gsub("\\s+", "+", text)
theurl <- paste0(root, text)
require(XML)
out <- capture.output(htmlTreeParse(theurl, useInternalNodes = TRUE) )
temp <- tempdir()
outfile <- file.path(temp, "out.html")
cat(out, file=outfile)
if (open) browseURL(outfile)
return(invisible(outfile))
}
x <- "Give us your hard drives if you ever want to see your children again. Signed, The Robots."
ransom(x)
```

It is mostly Windows friendly but other people could use and/or modify to their own purposes:

Code:

```
path_ <-
function(x=NULL, copy2clip = TRUE){
if (is.null(x)) x <- reports::WP()
m <- strsplit(x, "(?<=[/|\\\\])", perl=TRUE)[[1]]
o <- paste0(lapply(2*0:c(length(m)-1), function(i) paste(rep(" ", i), collapse="")), "-> ", m)
if (copy2clip) try(cat(paste(o, collapse="\n"), file="clipboard"))
invisible(paste(o, collapse="\n"))
}
```

So if the following was on your clipboard:

Code:

`Copy\CLaRI_Engineering_Literacy_Project_2014-15\META_PROJECT\IRB\Documents`

Code:

```
-> Copy/
-> CLaRI_Engineering_Literacy_Project_2014-15/
-> META_PROJECT/
-> IRB/
-> Documents
```

Here is code that will allow you to produce random data from a sample of data. A little recognition would be great if you use the code.

Thanks.

http://fullforceanalytics.us/Blog.html

I wanted to learn how kmeans works (and believe I have). I created a crude 2 k model with visual plotter to understand the process. Basically, you start with 2 random spots and calculate the distance of each point to the center spots. Which ever spot is closer the point get's classified into. You then recalculate 2 new centers based on the last assignment and repeat until there is no change in assignment. Critiques, rebukes, and improvements welcomed. It is designed more for me to learn so each iteration requires a key press.

I kept everything in base install R. Also...Not optimized or generalizable to k clusters (these improvements are welcomed). Also relies on global variables.

Code:

```
#Fake data
x1_p <-rnorm(100)
x2_p <-rnorm(100)
x1 <- c(x1_p, x1_p + rnorm(100, 3, .2))
x2 <- c(x2_p, x2_p + rnorm(100, 3, .1))
# plot data
plot(x1, x2)
k <- 2
# random center finder
random_center <- function(){
start_x1 <- sample(seq(min(x1), max(x1), by=.02), k)
start_x2 <- sample(seq(min(x2), max(x2), by=.02), k)
split(data.frame(cbind(start_x1, start_x2)), 1:2)
}
# k-colorizing plotting function
replotter <- function(){
plot(x1, x2)
points(centers[[1]][1], centers[[1]][2], pch=19, cex=2, col="red")
points(centers[[2]][1], centers[[2]][2], pch=19, cex=2, col="blue")
cents <- do.call(rbind, centers)
text(cents[, 1], cents[, 2]+.4, labels=c("1", "2"))
}
converged <- FALSE
centers <- random_center()
assignment2 <- NULL
replotter()
# kmeans loop
while(!isTRUE(converged)){
replotter()
assignment <- unlist(Map(function(a, b) {
p<-which.min(c(
dist(cbind(c(centers[[1]][1], a), c(centers[[1]][2], b))),
dist(cbind(c(centers[[2]][1], a), c(centers[[2]][2], b)))
))
points(a, b, col = ifelse(p==1, "red", "blue"), pch=19)
p
}, x1, x2))
invisible(readline(prompt="Press [enter] to continue"))
if (length(rle(assignment)$lengths) == 1){
centers <- random_center()
} else {
centers <- lapply(split(data.frame(cbind(x1, x2)), assignment), colMeans)
if (!is.null(assignment2)) converged <- all.equal(assignment2, assignment)
assignment2 <- assignment
}
}
assignment
```

One approach I would recommend is to think of this like Expectation-Maximization. Thus, break it into 2 functions E and M. The E function will take the random centers (initially) and do the assignments based on your distance metric (who is closest to each center). The M function will then recalculate the centers based on the new assignments. Thus, the body of k-means becomes a simple repeat and E and M until you get convergence.

Code:

```
centers <- random_centers(k) # initialize or select random points or other start methods
while (test_convergence(...))
{
assignments <- E(centers, d) # d is the distance function to use
visualize(assignments, centers)
centers <- M(assignments, d) # optimize the centers again
visualize(assignments, centers)
}
```

Anyway, I don't know if there's a function that will create binary dummy variables for you from a vector--and some QA would need to handle NA values--but here is my dummy function. The only things I found online were to use model.matrix, so this basically wraps that technique.

Code:

```
dummy <- function(x)
{
nm <- sort(unique(x))
x <- data.frame(f = factor(x, nm, nm))
x <- model.matrix(~ 0 + f, x)
colnames(x) <- nm
x
}
```

Code:

`diag(length(unique(x)))[rep.int(unique(x),table(x)),]`

Code:

`diag(length(unique(x)))[match(x,unique(x)),]`