Antoine Eripret Profile picture
Jun 28, 2021 15 tweets 8 min read Read on X
⏰¿Cómo reactivar un dominio expirado rápido? ⏰

Acabas de capturar un dominio expirado y quieres recuperar y subir los contenidos que están en archive.org.

Cómo hacerlo si tienes muchas páginas sin perder demasiado tiempo?

En este hilo, te explico todo.
Voy a suponer en este hilo que quiero reactivar mi propio blog.

Un caso bastante improbable pero así tenemos un ejemplo simple con el cuál puedes jugar también. Image
Etapa 1: Consultar la API de archive.org

Usando la API de https://t.co/ORlt4dS4F8, podemos obtener una lista única de URL para un dominio.

En mi ejemplo, sería web.archive.org/cdx/search/cdx… Image
Si te fijas, incluye por defecto contenido HTML pero también CSS, JS etc...

Aquí, queremos únicamente las páginas HTML, así que toca filtrar un poco. Image
Esta lista incluye la URL para las cuáles archive.org tiene contenido.

Usando otra API, podemos obtener el enlace directa del snapshot de https://t.co/ORlt4dS4F8.

Por ejemplo: archive.org/wayback/availa… Image
Al final de esta etapa, debes tener una tabla con las URLs originales y la URL del snapshot de archive.org

Yo lo hago con Python y requests (pypi.org/project/reques…), pero lo puedes hacer como quieras, lo que importa es el resultado. Image
Etapa 2: Identificar el contenido por extraer

Usando unos contenidos de ejemplo, debes identificar el contenido que quieres extraer.

En este ejemplo, quiero extraer todo el contenido HTML dentro de la etiqueta <article> cuya clase contiene "post". Image
Etapa 3: Extraer el contenido

Usando las URLs de los snapshots de archive.org, puedes extraer el contenido que te interesa.

Lo puedes hacer con Screaming Frog o lo que quieras, pero es importante que te extraiga el contenido HTML y no únicamente el texto.
Etapa 4: Convertir el contenido en HTML limpio

No quieres conservar la estructura HTML original. Quieres un código HTML limpio, sin las clases etc... que usaba el contenido original.

Uso pypi.org/project/markdo… y pypi.org/project/Markdo… para hacerlo a escala.
Etapa 5: Descargar las imágenes

El código HTML descargado incluirá referencias a imágenes (etiquetas <img> o <figure>).

Debemos:

1. Identificarlas
2. Descargarlas si podemos
3. Eliminarlas del código HTML si no podemos descargarlas
Si usas Python, esta lógica es muy fácil de implementar con docs.python-requests.org/en/master/ y crummy.com/software/Beaut….

El ahorro de tiempo que supone hacerlo de manera automática es una barbaridad.
Etapa 6: Subir tus imágenes

Todavía tendrás que subir estas imágenes en tu servidor y actualizar el código HTML con la URL correcta. Si no lo haces, harás peticiones a archive.org.

Puede funcionar, pero impactará tu WPO. Image
Etapa 7: Actualizar el enlazado

Por defecto, todos el enlazado incluye enlaces hacia snapshots de archive.org

Tienes que manipular tu código HTML para:
1. Usar la URL correcta
2. Eliminar los enlaces internos hacía contenidos que no has podido extraer
Etapa 8: Subir los contenidos

Una vez hayas hecho todo este trabajo, puedes seguir lo que explicaba en otro hilo ().
El proceso puede parece largo, pero piensa que ahorrarás muchas horas de trabajo manual (y aburrido).

Y lo podrás usar para hacer algo más interesante que copiar pegar texto.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Antoine Eripret

Antoine Eripret Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @antoineripret

Apr 13, 2023
🤓 Google Sheets formulas every SEO should know 🤓

Let's go through the most common formulas you need to master to work quicker.

Most of them can also be used in Excel, but not all of them.
1. VLOOKUP

THE formula you have to master because it allows you to merge data from different tables. Very useful to combine Search Console and Analytics data, for instance.

You have to master it. Image
2. FILTER

I've explained everything about this formula in a separate thread:
Read 15 tweets
Apr 12, 2023
🛑 SEO tip: Never assume search intent 🛑

These three queries have the same search intent but, for some reason, Google thinks that "python playground" is different. Bing as well.

Should have I done it manually, I'd have created one URL to target these queries. Image
Funny to see how a website is ranking on both (supposed) intents, with two versions of the same functionality.

While others are just present on a part of the demand.

Be smart and look at SERP data before taking decisions! Image
PS: Screenshots from @keywordinsights
Read 4 tweets
Apr 11, 2023
🛣️ How to create an efficient SEO roadmap 🛣️

SEO theory is relatively easy, but pushing changes into production is harder. How many ideas have you got? How many will be live?

Let me explain how I manage to define & plan an SEO roadmap with other teams! Image
Foremost, it's important to keep track of all the ideas you have. A simple to-do list is enough.

A sort of brain dump to ensure that even if you can't implement them now, you never forget an idea that occurred to you. Image
When comes the planning phase (frequency depends on how the organization operates), take these ideas and:

* Create a brief summary (2-3 sentences)
* Assign an SEO priority
* Ask IT to assign a complexity

Based on these two criteria, you can define a prioritization. Image
Read 13 tweets
Apr 4, 2023
🚦 Find cannibalization at scale using GSC 🚦

Keyword cannibalization means that you have more than one content ranking. It's often a situation you want to avoid.

Easy to spot when you check a couple of URLs, but how to handle thousands of URLs?

Let me explain! Image
When you intend to spot a cannibalization, you can use an external tool such as @semrush.

Head to the Keyword Gap tool and introduce the two URLs you wish to compare.

Great if you don't have access to GSC data, for instance. Image
If you have access to GSC, you can achieve the same using first-party data.

* Filter on a specific query (you can also use a REGEX)
* Go to the "Pages" tab

You can see quickly see which URLs are ranking for this query. Easy, right? Image
Read 11 tweets
Mar 9, 2023
🕵️ How can you spy on a competitor's content strategy? 🕵️

Your strategy must never be a simple pale copy of what others are doing, but it's always a good idea to know what they are up to.

Let me show you, with a real example, how you can generate insights quickly. Image
Let's assume we're working in the travel industry and one of our competitors is Skyscanner.

We want to understand what they are doing on their blog and generate some insights based on the data we have at our disposal. Image
First step: get an exhaustive list of their URLs

This could be done through a crawl, but I'd rather get the list from a sitemap. Not always doable, but in this case, it was easy to find what I was looking for. Image
Read 11 tweets
Feb 7, 2023
🚨 New article: content rehydration and SEO 🚨

JavaScript SEO is not going away, and is often challenging.

If you want to know what is content hydration and how it can cause huge traffic drop, check my article or read the thread below

aeripret.com/content-rehydr… Image
Content rehydration is a process that occurs when a website, built with a JavaScript framework, such as Angular or React, dynamically updates the content on a page without requiring a full-page refresh.

Why using rehydration instead of relying only on SSR? It is faster!
What is the issue with content rehydration?

It will add a script to the raw response sent by your server with all the required code to make the application dynamic. Out-of-the-box, this script can easily represent more than 90% of the total HTML size. Image
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(