My statistician/data-science friends all tell me that if you're using a spreadsheet, you're not doing science, you're courting disaster. Real analysis requires Python, or, possibly, #julialang.
Despite these warnings, plenty of mission-critical work gets done in spreadsheets, and (in support of these warnings), it can go horribly, horribly wrong.
It's years of destructive, crushing austerity - costing real human lives; trashing cities, regions, whole countries - due to spreadsheet formula errors:
But still, we keep using spreadsheets to do real work. I did it YESTERDAY. And made a stupid mistake.
The abstinence-only approach to spreadsheets has been a failure. Clearly, we need harm-reduction.
4/
That's where "Data Organization in Spreadsheets" - @kara_woo and @kwbroman's 2017 paper in @AmstatNews comes in. It lays out a crisp set of best practices for avoiding common errors, upping your CVS catastrophe game to really powerful mistakes!
* Be consistent: Don't use "Male," "male" and "m" as labels. Pick one
* Don't let trailing spaces creep in
* Use a consistent code for missing values (not a blank space and ESPECIALLY not a number like -99999)
6/
* Have a column for explanations about missing data (don't fill empty cells with explanations for their emptiness)
* Use consistent variable names and subject identifiers; treat as case-sensitive. No spaces!
* Lay out all your data consistently, in every file
7/
* Have a consistent (case-sensitive, no-spaces) filename convention. Do not tempt fate by calling a file "final" lest you have to pay penance with files named "final_ver2"
* Use YYYY-MM-DD for dates. No exceptions!
* Guard against trailing spaces in data!
8/
* Don't use special characters apart from _ and - in variables (avoid $, @, %, #, &, *, (, ), !, /, and other chars that have special meanings in some programming languages
* Format cells as "Text" to keep Excel from turning things like gene-names ("Oct-4") into dates
9/
* Consider giving year, month and date their own columns to prevent Excel from munging them (or write as an integer: 20201014)
* Only put one piece of data in each cell; use column labels to indicate units (eg "45" not "45g")
* Only one row of variable names per sheet
10/
* Maintain a separate "Data dictionary" file that defines every variable
* Datasets should not contain calculations; minimize how much typing you do in your dataset files lest you contaminate them inadvertently (calculations go in separate files)
11/
* Font colors and highlights are not data - put data in cells, not formatting (this gets lost in transitions)
* Backup multiple versions of your files, onsite and offsite
* Develop data validation tactics and regularly validate your data
12/
* Use CSV, not xlsx, as your canonical file-format - good old hard-to-corrupt text, flensed of all the fooforaw that Microsoft likes to insert at random intervals
Lurking behind every one of these tips is a postmortem on a data-tragedy. Ignore them at your peril.
eof/
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Inside: Dystopia as clickbait; Trail of Mars; Bride of Frankenstein and the Monster; The Passenger Pigeon Manifesto; Bricked Ferrari; The Dennis Ball Show; and more!
Tonight's Attack Surface Lecture: Intersectionality: Race, Surveillance, and Tech and Its History with Malkia Cyril and Meredith Whittaker app.gopassage.com/events/cory-do…
A great hero of the copyright wars is @realdjbc, AKA Bob Cronin, creator of the amazing groundbreaking #Beastles mashups, a virtuosic combination of the Beastie Boys and The Beatles:
DRM is a system for prohibiting legal conduct that manufacturers and their shareholders don't like.
Laws like the US DMCA 1201 (and its equivalents all over the world) ban tampering with DRM, even if no copyright infringement takes place.
1/
That means that manufacturers can design products so that doing things that displease them requires bypassing DRM, and thus committing a felony. It amounts to "felony contempt of business model."
2/
The expansive language of DRM law makes it a crime to break DRM, to tell people how to break DRM, to point out defects in DRM (including defects that make products unsafe to use), or to traffick in DRM-breaking tools.
3/
In 2014, I gave a keynote at Museums and the Web on the suicide-mission of cultural institutions that had decided to sacrifice access - making their collections as broadly available as possible - for revenues (selling licenses to rich people).
I argued that rich people didn't want museums, they wanted to own the things the museums had in their collections; so if museums eschewed universal access to get crumbs from plutes, they'd end up with rich people slavering to dismantle them and no public to help them resist.
2/
Now, a group of professionals and institutions from the galleries, libraries, archives and museums (#GLAM) sector have published the "Passenger Pigeon Manifesto," in which they eloquently make the same point.
I first encountered @jmcdaid through "Uncle Buddy's Funhouse," his 1993 ground-breaking, award-winning hypertext project - one of the first CD ROMs written up in the NY Times. It was such an exciting, original, weird and artistically satisfying piece, especially the music.
1/
Later, John and I became writing colleagues, attending workshops together, and then friends - for decades now. His work remains weird, erudite, accessible, madcap and brilliant.
2/
He's just released a new album of filk/folk music: "Trail Of Mars," recorded during the plague months with an all-star set of session musicians whom John was able to contract with thanks to the unprecedented drought in musical work.