Health services research using United States cancer databases
Here is everything you want to know about @theNCI SEER, @AmericanCancer@AmCollSurgeons NCDB, and newer claims databases for clinical research in oncology
🧵
First, many thanks to these great people for helping me with the material
Retrospective databases are ideal for certain types of questions related to epidemiology, staging, rare diseases, quality, prognostication, prediction, and some "real world evidence / data"
However, we should be cautious in using these databases for (1) comparative effectiveness research, and (2) comparing outcomes of patients today vs a prior era
(1) These databases are not meant for comparative effectiveness research, ie evaluating tx A vs B.
If you're considering it, send your data to me and @wedney2017 and we will show you how you can get any answer you want: A>B, B>A, A=B😅
For example, there is no data on the #1 diagnosed cancer in the US, basal/squamous cell skin ca. Most of these cancers are extirpated, frozen, desiccated by PCPs, dermatologists. We can't get a reliable numerator/denominator on cases.
Some have questioned coding reliability, and there have been years where coding changes impacted the database, though these were corrected.
RT has been taken out of core variable pack bc it is difficult to find after the pt had surg, if they went on to get RT closer to home.
Generally, SEER has high quality data. It undergoes QA and audits by qualified professionals, adhering to 2 basic principles:
auditing high quantity data (eg, breast ca)
auditing high risk data (eg, new staging system)
SEER also excels because it provides ICD cause of death, which is not present in NCDB (or any? claims database). However, coding cause of death is difficult.
For James Bond, you only live twice.
For SEER, you only die once (i.e., there is only 1 cause). seer.cancer.gov/codrecode/1969…
Cause of death comes from death certificates, from the physician caring for the patient at time of death.
When you access SEER, there are different "sessions" you can use.
"Case listing" is the session that most people would be familiar with.
To run the session: 1. file, new 2. Data: SEER registry you want (some have diff yrs, variables) 3. Selection: select specific cancer pts 4. Data: select variables you want for columns. more better than less here 5. Lightning bolt executes
Here are data.
Ctrl-C: copy cells, then paste into program, eg Excel.
Ctrl-R: copy session info (EC/IC). Paste this in another tab in Excel.
Most journals want to know EC/IC so others can replicate your work.
Save the .SL + .SLM files too, in case you want to reopen in SEER.
You can do the same with SIR session to get observed/expected events, 95% CIs, person years at risk, mean age of event
SIR is similar to relative risk. The denominator (expected) comes from the general US population (cancer + non-cancer pts).
Here is how to get SIR data for specific cause of death
1, New, MP-SIR
2, Database selection
3, Rates, Selection you can probably leave as is
4, Parameters: select follow up time latencies
5, Events: what COD do you want?
6, Statistic: leave alone
7, Table: what do you want table to look like?
8, Lightning bolt
9, Getting data...
10, Completed analysis
If your worksheet comes up with all 0s, it's bc you didn't select COD in the dropdown on the last screen.
Questions you can and cannot answer with @theNCI SEER
SEER can be linked to different databases.
SEER Medicare is a popular option.
Here they are juxtaposed.
NCDB is like a collection of case listing files that you would have seen in SEER. Each file is specific to a disease site. You apply for select sites and they are sent to you. For larger questions, you can merge files PRN.
NCDB is focused on treatment quality.
NCDB has much more information than SEER about treatment, including surgery, systemic therapy, radiation therapy.
NCDB states the data are hospital-based, not population-based. The SEER processes to ensure representation of minorities are not necessarily in place.
Data come from CoC-accredited facilities (~70% of US centers). Other caveats re data similar to SEER exist.
For reference, here is what the data in NCDB looks like.
One variable you will notice immediately is the facility ID, ie, the place where the pt was treated. It's not possible and not allowed to decode for specific facility name.
Part III of this tweetorial: comparing SEER vs NCDB
SEER has greater focus on epidemiology, incidence, mortality, cause of death.
NCDB has greater focus on surveillance, treatment, quality.
The data dictionaries for SEER and NCDB are available online:
It would be great if SEER and NCDB could next integrate these variables, many of which are already commonly collected at time of consultation with an oncologist.
Part IV: Claims databases for health services research
One of the most popular new claims databases is MarketScan, which includes ICD, CPT, HCPCS codes.
MarketScan covers >80M patients, is not specific to oncology, and includes private insurance (ie, pts < 65 yo).
MarketScan + SEER allow us to estimate the cost of cancer care in the United States
Here are some other projects you can and cannot do with MarketScan.
One of my favs: classification of common human diseases derived from shared genetic and environmental determinants @NatureGenet nature.com/articles/ng.39…
Similarly, TriNetX is a claims database that can be used in oncology.
Data collection in SEER, NCDB, hospital databases has "classical" formatting. There is basically just 1 time point (at diagnosis) with covariates. There is a variable that provides time until last follow up and vital status. A ton of data are missing.
Claims databases may provide many more time points with data. Soon, we may also be able to integrate text, images, etc.
These databases are ideal for analysis with AI/ML.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
A picture is worth 1000 words.
Here is oncology in a few pics.
@DanTrifMD@ARRO_org@SpringerNature Starting with pediatrics:
Rhabdomyosarcoma treatment paradigm for cancers of head/neck depends on parameningeal vs non-parameningeal location. PM is an unfavorable site, affects stage. #sarcoma#HNCSM