As we practice and teach Data Science, we continuously learn, unlearn and revise old and new concepts.
What are some freely available reading lists that give that help this or give a great intro to Data Science?
Another great one which details specific vital segments like clustering and dimensionality is this book/course from University of Utah: cs.utah.edu/~jeffp/teachin…
(3/n)
In addition to big data, this also goes into data visualization which is a big part of data understanding and communication: dan.bjorkegren.com/bigdata/
And finally, the blog Towards Data Science on Medium is probably the best resource when in need of a brief recap on almost any topic!
(towardsdatascience.com)
Tell me what resources you find helpful! :)
(8/n=8)
• • •
Missing some Tweet in this thread? You can try to
force a refresh
For some #MondayMotivation, let's create a great resource of fellowships, workshops and communities in Data Science.
I'll start with some!
(1/n)
The Women in Data Science Conference (widsconference.org) is a great place to learn, network and grow.
2/n
The ACM SIGHPC Computational & Data Science Fellowships(sighpc.org/fellowships), with an upcoming deadline fosters diversity in Data Science and allied fields.
3/n
Happy Friday!! Today I'd like to describe two important approaches to data privacy research and applications: synthetic data and differential privacy. I hope to generate more interests in this area among researchers and practitioners!
1/n Data privacy and data confidentiality are important topics for statisticians, computer scientists, and really, anyone offers their own data and consume data!
2/n Statistical agencies, in particular, are under legal obligations to protect the privacy and confidentiality of survey and census respondents, e.g. U.S. Title 26.
Happy Thursday! Today, I'd like to introduce and discuss various approaches, innovations, and resources for introducing Bayesian statistics to the undergraduates! I am sure I will miss something good, so feel free to add yours or the ones you know.
First, a little bit history. Bayesian methods became widely used, thanks to the computational advances in early 1990s, including the Gibbs sampler and Metropolis Hastings algorithms (e.g. Gelfand and Smith (1990)).
However, even before that revolutionary advance, innovative educators had designed ways to introduce Bayes to students: e.g. emphasizing the intuition on specifying prior for a data analysis problem while relying on numerical integration, Franck et al. (1988).
Let’s talk vectorization! You may have heard about or experienced how simple NumPy array ops (such as dot product) run significantly faster than for loops or list comprehension in Python. How? Why? Thread incoming.
Suppose we are doing a dot product on two n-dim vectors. In a Python for loop, scalars are individually loaded into registers, and operations are performed on the scalar level. Ignoring the sum, this gives us n multiplication operations.
NumPy makes this faster by employing vectorization, where you can load multiple scalars into registers and get many products for the price of one operation (SIMD). SIMD — single instruction, multiple data — is a backbone of NumPy vectorization.
Today I will be talking about some of the data structures we use regularly when doing data science work. I will start with numpy's ndarray.
What is an ndarray? It's numpy's abstraction for describing an array, or a group of numbers. In math terms, arrays are a "catch all" term used to describe matrices or vectors. Behind the scenes, it essentially describes memory using several key attributes:
* pointer: the memory address of the first byte in the array
* type: the kind of elements in the array, such as floats or ints
* shape: the size of each dimension of the array (ex: 5 x 5 x 5)
* strides: number of bytes to skip to proceed to the next element
* flags