Thread: After @icsjournal's "APIcalypse" issue & #IC2S2 2019, it's clear many are asking what the future holds for social media data. After working on privacy & data access at FB for > a year, I have thoughts. Thread ends w little known source for fb page data so read it
In 2011, @seanjwestwood & I ran a (IRB-approved) study using Facebook's graph API to analyze participants' entire ego network on the fly, have strong v weak ties endorse experimental post (stimulus), then re-render participants’ News Feeds. Those days are over.
The API was meant for developers to build on top of the social graph, but approvals were friendly to researchers. The scope of the data was startling & thankfully Sean had the foresight to delete data beyond what was necessary for analysis & publication.
Times have changed - OPM, MyLife, Equifax, Cambridge Analytica. Privacy issues are difficult to anticipate & secure, especially for complex & relational data. & CA showed the world that APIs are open to nefarious schemes and basic research alike.
Data privacy is in the air in newsrooms & capitals across the world. As privacy advocates devise strategies and work to protect people from identity theft, scams, information & other abuse, often social scientists advocating for research aren't at the table.
In the wake of these cross currents, privacy legislation & regulation present challenges for research communities. No company can shrug off a $5 bn fine for being too permissive w data—data collected via APIs in the name of social science research.
What's more under GDPR, meaningful, socially beneficial independent social science research often maps to the generally prohibited processing of "3rd party sensitive category data collected without consent." iapp.org/news/a/how-gdp…
Now the GDPR's research exemption carves out a protected legal space for that! BUT--only if originally *collected for research purposes.* The data everyone cares about was collected for business. What's left? Case-by-case opt-in--severe confounds, overhead, paperwork, etc.
Another path is if the data in question are anonymous. BUT GDPR does not define anonymity the same way that say HIPAA does--PII removed. Instead, data is anon if it cannot "reasonably" be re-identified. Sounds great, until you get into details pdpjournals.com/docs/88197.pdf
Differential privacy provides a theoretical framework showing that firm provable guarantees can be made. BUT they are probabilistic. What does "reasonably protected from reidentification" mean if it is always possible to some degree?
GDPR was not drafted to make things difficult for social media researchers. Rather, it may not have been crafted with this kind of research in mind. We are now grappling with likely unintended consequences.
What can be done? Global companies follow applicable law & regulation. Researchers need a seat at the policy making table to incentivize corporate sharing for basic research. May be even as simple as removing some of the downside risk.
People making decisions need to understand the societal value of social science. So keep sending work to @monkeycageblog, @MisOfFact, weigh in on Twitter, talk to journalists & present to policy-makers.
And learn about privacy - as social scientists we don't have the training to deeply understand the issues & participate in the debate. So read up on differential privacy.
Unfortunately, DP introduces noise, requires-hard to come by expertise & is suitable to answer a limited number of questions. The US Census has worked on DP for nearly a decade & assumes that it is not currently feasible for the ACS census.gov/content/dam/Ce…
And if you’ve read this far to the bitter end, here’s a tip. If you've been negatively affected by the pages API restrictions (e.g.,
🧵The study everyone here is talking about does NOT provide evidence that Twitter/X pushed a pro-Republican home timeline ranking change in July 2023.
The cascading, multiplicative effects of ranking changes likely explain the effect--details below
1. It's very likely that *something* in Twitter/X's home-timeline ranking system (HTRS) changed in July 2024.
2. It's hard to see but ALL accounts in the study saw increased visibility during that timeframe. There were only 5 Dem and 5 Rep accounts, which should raise eyebrows, but let's ignore that for now.
THREAD: TikTok's "algorithm" is magic and without it contemporary internet culture would simply melt away. Right?
What really happens if Douyin & TikTok split, leaving the latter without "the algorithm?"
The media talks about TikTok's algorithm as if it's the company's secret sauce. Now, we don't technically know exactly how it works but here's the thing: these recommender systems, including Facebook, Instagram, and especially Reels, they are all quite similar under the hood.
Most of what they do is rank a set of content for a user based on engagement. The key difference is that TikTok (& Reels) have one more killer ingredient to add to the mix of engagement measures: time spent watching *this* video.
1/ Many said Science went overboard on its cover for those Meta studies.
Science published my take yesterday.
It shows Meta’s algorithms actually tended NOT to increase ideological segregation in general, at least in 2020 🧵
2/ Key point 1: González-Bailón et al 2023 claim newsfeed ranking increases ideological segregation (Fig 2B). BUT that’s based on domain-level analysis. URL-level analysis (Fig 2C) shows *no difference* in ideological segregation before and after feed-ranking.
3/ So what? We should strongly prefer URL-level analysis. Domain-level analysis effectively mislabels highly partisan content as “moderate/mixed,” especially on websites like YouTube, Reddit, and Twitter (aggregation bias/ecological fallacy).
I wrote about The Algorithm: using Musk's metrics in ship decisions, what the Republican/Democrat code means for democracy, how Twitter's API $ increase undermines transparency efforts, & on the tech bros claiming to analyze it 'so you can go viral.'
What does this mean for transparency? You need algorithmic audits to really understand what's happening on twitter, and with the recent API prince increases, to $500k/yr for meaningful access, made it incredibly difficult for research to audit this code in recent weeks
Also, Twitter is not downranking tweets about Ukraine. This is a label related only to crisis misinformation, as per Twitter’s Crisis Misinformation Policy. This code specifically governs Spaces, not ordinary tweets on hometimeline.
Pundits and media commentators often assume large campaign effects, while many past studies find extremely small effects, often indistinguishable from zero. Measuring effects of the billions spent on political ads is one of the most significant challenges in the social sciences.
Important proposal from @persily. I was “the other side of the table” at Facebook when Nate was working on SS1, though we almost always agreed about the right way to share data. There’s a lot to like here for policymakers, researchers *and platforms* (brief thread)
First, this straight up exempts university affiliated researchers from liability for scraping data for IRB-blessed projects. It's important to enshrine this into law to (1) protect researchers and (2) make it crystal clear that this work is normatively *good* for the world.
This kind of research ought to be given safe harbor and platforms ought not to discourage it