What does the french open data ecosphere looks like? Here is a graph made by Data Publica team that shows links and relationships between all French websites that talk about open data.
Our project aims to build a subgraph of the web, consisting of the French websites mentioning open-data. This graph enables viewers to see popular websites and connections between them, to see which kind of entities communicate with the others (Companies, Non profits/Blogs, Government agencies). It is a good way to discover the actors of the French open data, and how they relate to one another. The graph can be seen on this page. The nodes are manually grouped by categories: type (Companies, Non profits/Blog, Government agencies) and roles (Open-Data Speaker, Open-Data Dealer), which led to two different graphs, accessibles via the menu. (In French, find out result’s analysis on this page.)
Applying for Common Crawl Code contest, Data Publica wanted to figure out the map of French open data ecosystem. This map aims to identify and show size and relationship between open data actors on the French web.
Common Crawl (http://commoncrawl.org/) is a non-profit organization that has the goal of democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible.
Common Crawl data are hosted on Amazon EC2 to facilitate their access through Map/Reduce. These data are frequently updated.
So far, Common Crawl covers about 20% of world wide web data : this large coverage allows our team to work on the first biggest layer of the web that contains most accessible or most used part of the web. It’s still possible that some actors could be missing from our map (and we apologize in advance).
To crawl the Common Crawl archives, our team has used Map/Reduce processes thanks to Hadoop Framework (rolling on Amazon S3). Common Crawl only allowed us to catch a subunit (such as french web), we had to browse all their achives data to collect the data we used to create our graph.
Though we only kept for our graph the visited websites that fit to these criterias :
- The website mainly talks about open data
- The website is accessible in french speaking language
Each website had a score calculated as follow:
- Sopendata = Σ(popendata) / Card(s)
- Sfr = Σ(pfr) / Card(s)
popendata : website’s pages about opendata
pfr : website page s in french speaking language
If results of the two following formula were having better marks than the thresold calculated with test samples websites then the website has been kept for the graph’s construction.
Outcoming links of each websites have been verified to check out where they aimed.
From theses incoming and outcoming links, we were able to create two files that would be used to define our graph structure:
- Domains’ files (id graph’s nods)
- Links between domains files (id graph’s links)
Then we had to categorize nodes in order to transform the map as an analysis pattern. To do that we chose two different axis, the resuls was two different graphs :
1. Actors typology
B. Non-profit organizations (blogs, associations..)
C. State/Public authorities
2. Actors involvement
A. To informer about open data
B. To deal open data
Each website has been then categorized hand by hand before inserting data in Gephi software (https://gephi.org/) where they’ve been spatialized (Force Atlas Algorithm).
A more detailed analysis is available in French here.
Representatives of civil society have built the main network promoting open data in France. The graph also highlights the leading position of several actors such as La Fing (Fondation Internet Nouvelle Génération) and the very active blog Internetactu.
We also noticed the wide variety of french companies operating in this open data field but quite isolated from one another. But it looks like each one created its own network with non-profit or State/public websites.
The graph shows the youth of open data in France born in the civil society, carried by bloggers and transparency activists but also promoted by public and private initiatives
Whole project is available in open-source on github Data Publica : https://github.com/datapublica-company/opendata-graph
A webpage dedicated to the project here : http://french-opendata.data-publica.com.
Thanks to Pierrick Boitel, Perrine Letellier, & Amine Mouhoub fot their work and contribution to the project
Cette œuvre est mise à disposition selon les termes de la Licence Creative Commons Attribution 3.0 France.