Nicholas Zaorsky, MD MS Profile picture
Vice Chair, Ed, @RadoncUH | Tenured Associate Prof, #RadOnc @cwru | Director, GU Onc, E Cleveland | Prostate, kidney, metastasis, skin, public health |

Feb 28, 2022, 49 tweets

Health services research using United States cancer databases

Here is everything you want to know about @theNCI SEER, @AmericanCancer @AmCollSurgeons NCDB, and newer claims databases for clinical research in oncology

🧵

First, many thanks to these great people for helping me with the material

Retrospective databases are ideal for certain types of questions related to epidemiology, staging, rare diseases, quality, prognostication, prediction, and some "real world evidence / data"

However, we should be cautious in using these databases for (1) comparative effectiveness research, and (2) comparing outcomes of patients today vs a prior era

(1) These databases are not meant for comparative effectiveness research, ie evaluating tx A vs B.

If you're considering it, send your data to me and @wedney2017 and we will show you how you can get any answer you want: A>B, B>A, A=B😅

(2) These databases are not meant to compare outcomes (via KM plots) over major eras.

The KM plots are often affected by the Will Rogers phenomenon

Here are the trends in publications using these databases.

SEER and NCDB make up the majority of oncology health services research.

Data from @theNCI SEER contains information on ~1/3 of cancer cases in the US since 1973. Data come from minority-enriched geographic areas.

You can get data here:
seer.cancer.gov/data/access.ht…

Do the tutorials here:
seer.cancer.gov/seerstat/tutor…

SEER has awesome data, includes US census info (so proportions, risks can be calculated), and it continues to evolve

Two great papers about strengths and weaknesses of SEER from @HenryParkMD @jamesbyu
pubmed.ncbi.nlm.nih.gov/22481006/
pubmed.ncbi.nlm.nih.gov/22481009/

1, you could calculate incidence and mortality data on specific cancers since the 1970s

2, you can evaluate risk of death from a particular cause of death, eg stroke

@NatureComms

3, you can evaluate epidemiology of a particular disease state, eg metastasis

On the other hand, SEER has limitations.

For example, there is no data on the #1 diagnosed cancer in the US, basal/squamous cell skin ca. Most of these cancers are extirpated, frozen, desiccated by PCPs, dermatologists. We can't get a reliable numerator/denominator on cases.

Some have questioned coding reliability, and there have been years where coding changes impacted the database, though these were corrected.

RT has been taken out of core variable pack bc it is difficult to find after the pt had surg, if they went on to get RT closer to home.

Generally, SEER has high quality data. It undergoes QA and audits by qualified professionals, adhering to 2 basic principles:
auditing high quantity data (eg, breast ca)
auditing high risk data (eg, new staging system)

seer.cancer.gov/qi/tools/casef…

You can put in a separate data request to access the treatment variables, eg radiation, chemo.

SEER has you sign a separate form stating you understand these limitations of the variables: >85% of cases have correct treatment info.

seer.cancer.gov/data-software/…

SEER also excels because it provides ICD cause of death, which is not present in NCDB (or any? claims database). However, coding cause of death is difficult.

For James Bond, you only live twice.
For SEER, you only die once (i.e., there is only 1 cause).
seer.cancer.gov/codrecode/1969…

Cause of death comes from death certificates, from the physician caring for the patient at time of death.

Here is a blurb from @StoltzfusKelsey paper:
ncbi.nlm.nih.gov/pmc/articles/P…
ncbi.nlm.nih.gov/pmc/articles/P…
ncbi.nlm.nih.gov/pmc/articles/P…

When you access SEER, there are different "sessions" you can use.

"Case listing" is the session that most people would be familiar with.

To run the session:
1. file, new
2. Data: SEER registry you want (some have diff yrs, variables)
3. Selection: select specific cancer pts
4. Data: select variables you want for columns. more better than less here
5. Lightning bolt executes

Here are data.

Ctrl-C: copy cells, then paste into program, eg Excel.
Ctrl-R: copy session info (EC/IC). Paste this in another tab in Excel.

Most journals want to know EC/IC so others can replicate your work.

Save the .SL + .SLM files too, in case you want to reopen in SEER.

You can do the same with SIR session to get observed/expected events, 95% CIs, person years at risk, mean age of event

SIR is similar to relative risk. The denominator (expected) comes from the general US population (cancer + non-cancer pts).

Here is how to get SIR data for specific cause of death

1, New, MP-SIR
2, Database selection
3, Rates, Selection you can probably leave as is
4, Parameters: select follow up time latencies
5, Events: what COD do you want?

6, Statistic: leave alone
7, Table: what do you want table to look like?
8, Lightning bolt
9, Getting data...
10, Completed analysis

If your worksheet comes up with all 0s, it's bc you didn't select COD in the dropdown on the last screen.

Questions you can and cannot answer with @theNCI SEER

SEER can be linked to different databases.
SEER Medicare is a popular option.
Here they are juxtaposed.

Thank you to the @ACS_Research @AmericanCancer @AmColSurgCancer for providing this amazing resource.

NCDB is like a collection of case listing files that you would have seen in SEER. Each file is specific to a disease site. You apply for select sites and they are sent to you. For larger questions, you can merge files PRN.

NCDB is focused on treatment quality.
NCDB has much more information than SEER about treatment, including surgery, systemic therapy, radiation therapy.

NCDB states the data are hospital-based, not population-based. The SEER processes to ensure representation of minorities are not necessarily in place.
Data come from CoC-accredited facilities (~70% of US centers). Other caveats re data similar to SEER exist.

One concern w NCDB is that many patients have missing data, and patients with missing data may have worse outcomes.
@JAMANetworkOpen
jamanetwork.com/journals/jaman…

Questions you can and cannot answer with NCDB:

One of my favorite projects:

For reference, here is what the data in NCDB looks like.

One variable you will notice immediately is the facility ID, ie, the place where the pt was treated. It's not possible and not allowed to decode for specific facility name.

Part III of this tweetorial: comparing SEER vs NCDB

SEER has greater focus on epidemiology, incidence, mortality, cause of death.
NCDB has greater focus on surveillance, treatment, quality.

The data dictionaries for SEER and NCDB are available online:

seer.cancer.gov/analysis/
facs.org/-/media/files/…

SEER and NCDB have several variables in common.

These common variables inspired our STARS staging system for metastatic cancer.
@IntJCanc @uicc @AJCCancer @NCCN

We developed the system in one database and validated it in the other.

SEER and NCDB also have site specific factors, which provide more detailed information about a particular cancer.

seer.cancer.gov/seerstat/datab…
naaccr.org/SSDI/SSDI-Manu…

The availability of SSFs allows for validation of new staging systems, eg, @AJCCancer 8th vs 7th ed for oropharyngeal cancer, integrating HPV status.
Work from @TedTeknosMD
pubmed.ncbi.nlm.nih.gov/28939068/
#HNCSM

It would be great if SEER and NCDB could next integrate these variables, many of which are already commonly collected at time of consultation with an oncologist.

Part IV: Claims databases for health services research

One of the most popular new claims databases is MarketScan, which includes ICD, CPT, HCPCS codes.

MarketScan covers >80M patients, is not specific to oncology, and includes private insurance (ie, pts < 65 yo).

MarketScan + SEER allow us to estimate the cost of cancer care in the United States

ja.ma/3leArMs Via
@JAMANetworkOpen @JAMA_current

Here are some other projects you can and cannot do with MarketScan.

One of my favs: classification of common human diseases derived from shared genetic and environmental determinants
@NatureGenet
nature.com/articles/ng.39…

Similarly, TriNetX is a claims database that can be used in oncology.

Thanks to @AVnishKatoch @PennStateCTSI @PennStHershey for the information.

Here is a comparison of NCDB, SEER, SEER Medicare, MarketScan, and TriNetX.

Table adapted from Dan Boffa, @mafacktor work @JAMAOnc:
pubmed.ncbi.nlm.nih.gov/28241198/

Data collection in SEER, NCDB, hospital databases has "classical" formatting. There is basically just 1 time point (at diagnosis) with covariates. There is a variable that provides time until last follow up and vital status. A ton of data are missing.

Claims databases may provide many more time points with data. Soon, we may also be able to integrate text, images, etc.

These databases are ideal for analysis with AI/ML.

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling