Sitemaps and Meta Tags may Substitute for RSS
Digital publishers today provide machine-readable data to make sure their sites work well with large platforms like Google, Facebook, Slack, and Twitter. Machine-readable data like RSS intended for independent platforms is often omitted, but the information needed to populate a feed is still there in other formats.
Step by step, let’s investigate how we could build an RSS feed for a site that doesn’t provide one using only machine-readable data.
Starting from a robots.txt
at a well-known URI:
User-agent: *
Disallow: /*/r/
Sitemap: https://www3.nhk.or.jp/news/sitemap-news-index.xml
Sitemap: https://www3.nhk.or.jp/nhkworld/sitemap.xml
From /nhkworld/sitemap.xml
, sitemaps for sub-sections of the site can be found:
<sitemapindex>
<sitemap>
<loc>
https://www3.nhk.or.jp/nhkworld/sitemap_alternate.xml
</loc>
</sitemap>
<sitemap>
<loc>https://www3.nhk.or.jp/nhkworld/sitemap_app.xml</loc>
</sitemap>
<sitemap>
<loc>
https://www3.nhk.or.jp/nhkworld/data/en/news/sitemap.xml
</loc>
</sitemap>
<sitemap>
<loc>
https://www3.nhk.or.jp/nhkworld/data/en/news/videos/sitemap.xml
</loc>
</sitemap>
<sitemap>
<loc>
https://www3.nhk.or.jp/nhkworld/en/news/sitemap_reports.xml
</loc>
</sitemap>
From /nhkworld/data/en/news/sitemap.xml
, links to articles:
<urlset>
<url>
<loc>
https://www3.nhk.or.jp/nhkworld/en/news/20220520_37/
</loc>
<priority>1.0</priority>
</url>
<url>
<loc>
https://www3.nhk.or.jp/nhkworld/en/news/20220520_35/
</loc>
<priority>1.0</priority>
</url>
<url>
<loc>
https://www3.nhk.or.jp/nhkworld/en/news/20220520_39/
</loc>
<priority>1.0</priority>
</url>
<url>
<loc>
https://www3.nhk.or.jp/nhkworld/en/news/20220520_40/
</loc>
<priority>1.0</priority>
</url>
In the articles, <meta>
tags intended for use in social media embeddings provide useful information about the articles:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width">
<meta name="ROBOTS" content="NOODP, NOARCHIVE, INDEX, FOLLOW">
<meta name="format-detection" content="telephone=no">
<meta name="msapplication-config" content="/nhkworld/common/notification/browserconfig.xml">
<meta name="twitter:card" content="summary_large_image">
<meta property="og:image" content="https://www3.nhk.or.jp/nhkworld/upld/thumbnails/en/news/20220520_37_1116717_L.png">
<meta name="twitter:site" content="@NHKWORLD_News">
<meta name="keywords" content="News, Japan, Asia, World, Nuclear, Biz, Tech, NHK, Japan Broadcasting Corporation, Public Broadcaster, NHKWORLD, NHK WORLD, NHK WORLD PREMIUM, NHK WORLD TV, Radio Japan, Japan">
<meta name="description" content="The Japanese government has issued a fresh recommendation on the use of face masks. It says people need not wear a face mask when out of doors even if there is not a great distance between themselves and others. The advice is based on the assumption that little to no conversation is taking place.">
<meta property="fb:app_id" content="1612260969082183">
<meta property="og:type" content="article">
<meta property="og:site_name" content="NHK WORLD">
<meta property="og:title" content="Govt.: No need to wear masks outdoors if people have little conversation | NHK WORLD-JAPAN News">
<meta property="og:description" content="The Japanese government has issued a fresh recommendation on the use of face masks. It says people need not wear a face mask when out of doors even if there is not a great distance between themselves and others. The advice is based on the assumption that little to no conversation is taking place.">
<meta property="og:url" content="https://www3.nhk.or.jp/nhkworld/en/news/20220520_37/">
<title>Govt.: No need to wear masks outdoors if people have little conversation | NHK WORLD-JAPAN News</title>
<meta name="mobile-web-app-capable" content="yes">
<meta name="theme-color" content="#D8D8D8">
<meta name="application-name" content="NHK WORLD News">
<meta name="apple-mobile-web-app-title" content="News">
<meta name="apple-mobile-web-app-capable" content="no">
<meta name="apple-mobile-web-app-status-bar-style" content="default">
<meta name="msapplication-tap-highlight" content="no">
<meta name="msapplication-TileColor" content="#D8D8D8">
<meta name="msapplication-navbutton-color" content="#D8D8D8">
<meta name="msapplication-TileImage" content="/nhkworld/common/site_images/nw_webapp_news_144x144.png">
The same data might also be available as JSON inside a <script type="application/ld+json">
tag:
{
"@context": "http://schema.org",
"@type": "NewsArticle",
"mainEntityOfPage": "https://www3.nhk.or.jp/nhkworld/en/news/20220520_37/",
"headline": "Govt.: No need to wear masks outdoors if people have little conversation",
"articleBody": "The Japanese government has issued a fresh recommendation on the use of face masks. It says people need not wear a face mask when out of doors even if there is not a great distance between themselves and others. The advice is based on the assumption that little to no conversation is taking place.\n\nHealth minister Goto Shigeyuki said on Friday the government\u0027s stance on the need for a face mask to prevent coronavirus infection remains fundamentally unchanged.\n\nGoto said people do not need to wear a mask when speaking with others outdoors if there is a distance of at least two meters between the speakers. In the case of indoor conversation, a mask is recommended whatever the distance.\n\nHe went on to say that an indoor space with excellent ventilation might allow for maskless conversation at a distance of two meters or more.\n\nFor children between the age of two and elementary-school age, the government is reinstating its earlier policy of not having a blanket expectation that a mask should be worn. There is no such expectation for infants.\n\nThe government continues to call on people to be sure to wear a mask when visiting elderly people or anyone in hospital.",
"datePublished": "2022-05-20T22:35:00 JST+0900",
"dateModified": "2022-05-20T22:40:54 JST+0900",
"image": {
"@type": "ImageObject",
"url": "https://www3.nhk.or.jp/nhkworld/upld/thumbnails/en/news/20220520_37_1116717_L.png",
"width": 720,
"height": 405
},
"author": {
"@type": "Organization",
"name": "NHK WORLD",
"url": "https://www3.nhk.or.jp/nhkworld/"
},
"publisher": {
"@type": "Organization",
"name": "NHK WORLD",
"url": "https://www3.nhk.or.jp/nhkworld/",
"logo": {
"@type": "ImageObject",
"url": "https://www3.nhk.or.jp/nhkworld/common/site_images/nw_logo_422x60.png",
"width": 422,
"height": 60
}
}
}
Taken together, the sitemaps and the meta tags provide more than enough information to fill out an RSS feed.
The information available in sitemaps varies widely, even on the same site. The first sitemap listed in the robots.txt
leads to another <sitemapindex>
which leads to a sitemap that defines with xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"
a special tag for news:
<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9">
<url>
<loc>https://www3.nhk.or.jp/news/html/20220521/k10013617081000.html</loc>
<news:news>
<news:publication>
<news:name>NHK NEWS WEB</news:name>
<news:language>ja</news:language>
</news:publication>
<news:publication_date>2022-05-21T00:23+09:00</news:publication_date>
<news:title>【随時更新】ロシア ウクライナに軍事侵攻(21日の動き)</news:title>
</news:news>
</url>
<url>
<loc>https://www3.nhk.or.jp/news/html/20220520/k10013636261000.html</loc>
<news:news>
<news:publication>
<news:name>NHK NEWS WEB</news:name>
<news:language>ja</news:language>
</news:publication>
<news:publication_date>2022-05-20T23:20+09:00</news:publication_date>
<news:title>韓国 新首相の任命同意案可決も 統一地方選へ与野党攻防激化か</news:title>
</news:news>
</url>
<url>
<loc>https://www3.nhk.or.jp/news/html/20220520/k10013636251000.html</loc>
<news:news>
<news:publication>
<news:name>NHK NEWS WEB</news:name>
<news:language>ja</news:language>
</news:publication>
<news:publication_date>2022-05-20T21:48+09:00</news:publication_date>
<news:title>バイデン大統領 韓国の半導体工場視察 “供給網強化で連携を”</news:title>
</news:news>
</url>
The title and publication date are there in the sitemap, but it otherwise isn’t nearly as complete as what you would get from the meta tags or the JSON.
Combining sitemap and meta tags, all the information needed for RSS feed is there, but sitemaps can be large and every site is different. A human would need some way of specifying which items are of interest.