R Programming

33 readers

2 users here now

Please use this as a forum to discuss R, and learn more about it. If you have any questions about how to do specific things in R, this is the place to ask.

Getting Started

You can download R here.

You can download RStudio here. RStudio IDE, which is supported by Posit PBC, is a powerful and well-developed IDE for R. Other development environment options include Emacs addon Emacs Speak Statistics and VSCode.

Other Communities

Other communities that may be of interest across the fediverse:

Please send @a_statistician a message to recommend additional communities to add to this list.

Learning resources:

R for Data Science - a good introductory book for learning R. Start here if you're overwhelmed.
Big Book of R - collection of more than 500 online books/tutorials covering various aspects of R. Some links are to paid books with previews, but most links are to free online textbooks.

founded 2 years ago

MODERATORS

[email protected]

Does subsetting (matrices or arrays) always perform a partial copy? (lemmy.ca)

submitted 2 years ago* (last edited 2 years ago) by [email protected] to c/[email protected]

3 comments fedilink hide all child comments

Some large datasets are pushing memory and some functions I'm writing to the limit. I wanted to ask some questions about subsetting, of matrices and arrays in particular:

Does defining a variable as a subset of another lead to copy? For instance

x <- matrix(rnorm(20*30), nrow=20, ncol=30)
y <- x[, 1:10]

Some exploration with object_size from pryr seems to indicate that a copy is made when y is created, but I'd like to be sure.

If I enter a subset of a matrix/array as argument to a function, does it get copied before the function is started? For instance in

x <- matrix(rnorm(20*30), nrow=20, ncol=30)
y <- dnorm(0, mean=x[,1:10], sd=1)

I wonder if the data in x[,1:10] are copied and then given as input to dnorm.

I've heard that data.table allows one to work with subsets without copies being made (unless necessary), but it seems that one is constrained to two dimensions only – no arrays – that way.

Cheers!

you are viewing a single comment's thread
view the rest of the comments

[–] morcution 2 points 2 years ago* (last edited 2 years ago)

+1 for parquet and arrow. If you're pushing memory better to just treat it as a completely out of memory problem. If you can split the data into multiple parquet files with hive style or directory partitioning it will be more efficient. You don't want parquet files too small though (I've heard people saying 1 GB each file is ideal, colleagues at work like 512 MB per file - but that's on an AWS setup).

Bonus is once you've learned the packages it'll be the same for all out of memory big datasets.