Looking forward to this! CS people, seminar starts 14:00 in 1.4
Here @pgroth being introduced by @CaroleAnneGoble for his seminar "Flexible & Transparent Data Reuse" at @csmcr
Topics today: Problem of data reuse. Then data integration, data munging, data wrangling. Data provenance, lower the barriers to understand how data is made. Combining this to make it transparent.
Making data is hard. Quoting @karpathy and link.medium.com/srrJhEl5bS
Even very simple data processing becomes messy
"Provenance of data - data came from this email! But where did Mike get the data from? It turns out it was from me! But where did I get it from? From the Internet!'
Data interpretation on whiteboard - pretty much made it straight to the paper. Imagine the complexity of expressing how this fairly simple data munging happened!
Here @gregory_km recent research showing most of time time people do not find the data they need or can't understand it. To understand the data they need the people who made it!
Paper fresh of the press: doi.org/10.1177%2F0165…
Citing @dourish on materialities of information and Spreadsheet Events like seh.ox.ac.uk/news/the-case-… by @LeKissick
Bottlenecks of making data:
1. Manual
2. Difficulty in creating flexible/reusable workflows
3. Lack of transparency
doi.org/10.1109/MIS.20… and doi.org/10.1109/MIS.20…
Quick note from @pgroth - @INDE_LAB_AMS is growing (and hiring!)
Finding data - is it just searching?
arxiv.org/abs/1707.06937
@gregory_km
Integration of data into workflows. This has additional requirements. Just like running structured queries.
@Open_PHACTS @CHChichester
Building a knowledge graph. kgtutorial.github.io
Problem is that it is difficult to build a clean data graph
Or just skip all of that! Just use machine learning then autogenerate the data from a query. arxiv.org/abs/1811.06303
Only answer simple fragments. linkeddatafragments.org/concept/ #LD c
Use state of the art QA architecture, add lexicolized triples (?), post over inputs then inspect.
Buildt a model over every predicate, and it worked pretty well. Learning the answer of triples against the data.
Now onto data provenance.
This example is from globalchange.gov reporting on climate change. Where did it get the data from? What science is behind it?
Using PROV model
PROV used by @usgcrp for globalchange.org - you can do API queries against a figure in the report, see which agency it came from, which people were involved, which software was used - all in a standardised format.
But how do we make it easier to generate such provenance?
Software methodology - but you still have to do the work to instrument their own software. Yet it's quite hard, nobody does it.
Provenance becomes interesting only when something goes wrong or when you need to delve into the details.
How to lower the barrier of capturing provenance?
One aspect is reexecution.
Don't need to record everything if you have fixed data and remember the query.
Can you apply this to the whole operating system?
Recorded replay P. Panda VM. Every instruction recorded. Reexecute, adding instrumentation. You don't need to know upfront which provenance instrumentation you need - Lightweight or super-detailed chosen post-hoc.
Where do we go next?
Can we merge the knowledge in data provenance to help us with data reuse? Particularly making data and reducing the degrees of freedom while increasing transparency.
Can we track lots of people, and what they do with their data? Get tons of these provenance traces. Then learn from those, how to do data integration? Reducing barrier to building in transparency.
Now @pgroth collecting as much provenance data as possible to build machine learning models and try if this is possible.
Need transparency but also flexibility. With high-resolution provenance we can use that to learn how to help people do better data munging.
Q from @bparsia: In querying, data is trivial to make linked data. Generating data with certain properties is what is difficult. How does the knowledge graph add anything?
How much "better"?
Knowledge Graphs are hipster versions of linked data. You tend to resolve all your entities in an entity graph. When you write the query you already know who is Obama.
(..)
Getting clean data out at the end *is* difficult.
@bparsia: How much better is the knowledge graph?
"A lot better! "
Q @jaspkoehorst:
Automatic generation of knowledge graphs?
Easier to write a SPARQL query than transforming Excel spreadsheet to RDF.
A: Many ways to extract from semi-structured data like SNORKL (?).
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Stian Soiland-Reyes #FBPE 🇪🇺🇬🇧🇳🇴🇲🇽
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!