Blog
Blog
Web blog of the DASH project group
Hello and welcome to our blog.
10.06.2016
CEWE Stiftung & Co KGaA invited us to the CEWE Analytics Forum, which took place on 7 June. We created a presentation specifically for the audience consisting of board members and department heads. The results of the project group so far have been well received by the audience.
On 8 and 9 June, a nationwide and interdisciplinary conference for student research took place at the University of Oldenburg. The event is called "forschen@studium". Here students can present their research results from seminar papers, theses or project work. Further information on the conference can be found at uol.de/lehre/qualitaetspakt-lehre/flif/forschen-at-studium/konferenz-fuer-studentische-forschung/.
In order to take part in the conference, all interested participants had to present their research in a one-page abstract. We also wrote an abstract, which was accepted and published together with the other accepted abstracts in the conference proceedings "forschen@studium". Our abstract can be found on page 106(uol.de/fileadmin/user_upload/flif/Konferenz2016/Broschur_Forschen_A4_Internet.pdf). We are also aiming for publication in the student online journal "forsch!".
For the conference in Oldenburg, we presented our results in the form of a poster as part of a poster session. We were able to inform some interested parties about our project. We also received positive feedback.

Source: Photographed by Simon Becker
16.05.2016
In our ELT process, the PreX (Prepare XML) component was planned after AXT (Agent for XML Transportation) in order to merge the XML files and then transfer the merged files to the parser. However, we have now decided not to implement the PreX function separately in a component, but to include it in AXT.
Furthermore, there is still the problem of parsing large amounts of data using Hadoop Streaming and Spark. For this reason, some members of the analyses team will now support the parsing team and drive this work forward.
CEWE Stiftung & Co. KGaA has offered us the opportunity to present our results to date at the CEWE Analytics Forum. We want to accept this offer and are working on a presentation that focuses less on the technical details and more on the results that can be achieved with our work.
At the same time as the intensive development phase, the date for the project group boules tournament has now been announced. We will of course be taking part. In preparation for strengthening team spirit, we have also organised a football match against the DOHA project group.
02.05.2016
It is often the case in software development that once one error has been fixed, the next one occurs. We experienced something similar with our ELT process. Our AXT (Agent for XML Transportation) is now running smoothly because the faulty method has been replaced by another method. However, we now have difficulties parsing the data. We have out-of-memory problems with both the variant with Hadoop Streaming and the alternative with Spark, which means that we can only parse small amounts of data so far.
To ensure that the other objectives can still be achieved, part of the project group is already familiarising itself with the XML files with regard to data understanding. Initial analyses are even being carried out on the existing data.
In addition, we have now defined test officers in our project group who are responsible for test management. Initially, test scripts for the individual components will be created for this purpose.
From an organisational point of view, we want to reposition ourselves. Until now, we have been meeting every Thursday for the whole day to work intensively on the project as a group or in subgroups. This time together is very important because a lot is discussed and debated. However, we often miss 1-2 hours in which organisational matters such as Weekly Scrum, sprint completion and sprint planning are clarified. We now want to move this organisational part to another day so that we can be even more productive on Thursdays.
At our second social event, we were at the Barcelona Finca.
18.04.2016
Sometimes in everyday project life, especially in the context of a student-organised group, set goals cannot be achieved despite careful planning and coordination.
For example, our goal of a complete ELT run could not be fulfilled by the set milestone due to problems within AXT (Agent for XML Transportation). Intensive research revealed that the FTP server was sending an error code when transferring larger files. This error was not processed correctly by the Apache commons.net library method used, resulting in a kind of endless loop.
However, to ensure that the subsequent milestones are not affected, the project group is working in parallel on requirements and ideas for initial analyses and hypotheses.
It also became apparent that individual repositories are necessary for the various components when using GIT to manage our data. This is due to the requirements for the individual components and their growth. The result is a new structure in our GIT: AXT (Agent for XML Transportation), Parser, HIMP (Hive Importer; formerly Watchdog), Airflow, Tests, Analytics, Documentation, Miscellaneous.
Furthermore, in future we must get into the habit of first making changes to the source code on a branch. Only when the changes have been successfully tested can they be pushed to the master branch. This ensures that the master branch is always stable and contains an executable version.
04.04.2016
The "Dashboard and visualisation" subgroup has spent the last two weeks evaluating tools for software selection. The decision as to which tool should be used for visualisation and for our dashboard was made in favour of Apache Zeppelin. It fulfils the requirements and is easy to use.
The other subgroups are busy meeting the deadlines for the milestone on 7 April 2016, which includes the complete run-through of the ELT process (see graphic in four steps).

- In the first step, the XML data is transferred from the FTP server to the HDFS using our self-written application AXT (Agent for XML Transportation). There is a tree-like folder structure in the HFDS, which is subdivided according to years, months and days.
- The XML data is then prepared. This means either that several XML files are combined into one XML file and saved in the ready2parse folder in the HDFS, or that the original XML files are copied into the ready2parse folder. This step is carried out with our own tool PreX (Prepare XML).
- The XML files from the ready2parse folder are parsed. The results can then be found for the fourth step in the parsedData folder in HDFS.
- Finally, in the last step, the data is loaded into Hive tables with the help of Watchdog. Watchdog has been customised for the application and does not correspond to a "normal watchdog", as monitoring in the actual sense is not carried out.
The entire workflow is controlled using the workflow management system Airflow.
18.03.2016
In addition to the two existing subgroups for the ELT process, two new subgroups have now been created to complete open activities within the PG:
- Dashboard and visualisation subgroup
- Data Mining subgroup
This means that there are currently four subgroups of three people each in the PG.
No suitable application could be identified for loading the XML data into the HDFS, so the decision was made to implement this activity ourselves: "Agent for XML Transportation" (AXT). Unfortunately, it only became apparent in retrospect during the evaluation of individual functionalities instead of at an earlier stage that no tool existed that would fulfil the requirements and that development would have to be carried out in-house.
Regardless of the implementation, one aim of the PG is to acquire knowledge about the methods and tools used, which became apparent at the end of the seventh sprint. In this sprint, it was established that Confluence has the option of creating a retrospective, which will be used immediately to improve future sprints.
The current planning includes the milestone 07.04.2016: ELT process run on the alpha version. For this, our self-developed tools such as AXT or the parser must be ready and executable by then. In addition, all components involved must be connected to each other via a workflow management tool.
29.02.2016
The first status meeting in 2016 with CEWE Stiftung & Co KGaA takes place. The current status of the two subgroups and the component selection to date are presented. CEWE Stiftung & Co KGaA tells us the sizing of the Hadoop landscape for the beta version. It is also planned that the project group will give a presentation at an internal CEWE event, the so-called Analytics Forum, in June.
- The Gobblin tool is still being investigated. Apache Spark is being investigated as an alternative to Hadoop streaming. The parser can be called in Apache Spark.
- The previous XML files can be parsed with the Python parser. The results are saved in csv files according to the designed database model. However, the implementation is still static. A dynamic (partial) solution should be available in the beta version at the latest.
- The import of the csv files into the hive tables works. We are now checking whether the Apache Oozie work flow tool is suitable for checking whether new csv files are available.
- It is planned to set up a logging system. In the event of errors, this will help to determine in which component or step the error occurred.
Programming guidelines will be created. Although we have already started programming, it is not yet too late for this. The source code produced is still manageable and can be easily revised.
To avoid a pure documentation phase at the end of the project, the results to date are already compiled in a document for the final documentation.
Each team member is beginning to deal with the topics of data mining and visualisation or dashboards so that the upcoming tasks can be understood and distributed accordingly. Two new subgroups will be formed soon. One will deal with the topic of data mining and the other with the topics of visualisation and dashboards.
The project group's first social event took place in the 3Raumwohnung. There we played table football, chatted and simply had fun.
12.02.2016
After about half of the project group time, the first feedback meetings take place. Each team member has an individual meeting with the supervisors to find out their current grade or tendency and to mention any problems or special features.
As we realised last time that JIRA was not yet working perfectly for us, we are now spring-cleaning our JIRA system. Some team members are taking on this task. This includes revising the epics (subject areas in JIRA) and renaming the various statuses so that a clear distinction is made between "in test" and "completed". A task is only "finished" when it has been tested.
Hadoop streaming is being investigated. This makes it possible to distribute Python scripts on the cluster. As an alternative, manual distribution of the scripts is being investigated. The parser itself is being further developed. Hive tables are also being created. The Gobblin tool for extracting and loading data is being analysed. As this tool is still relatively new or under development, there are some problems here.
A first social event is planned for the end of the month. We will probably do this regularly in the future.
30.01.2016
The two server representatives from our group install the Hadoop ecosystem on the servers of the University of Oldenburg for our alpha version.
For extracting and loading the data, the tools Apache Kafka, Apache Flume, Logstach and Fluentd are analysed with regard to their installation and configuration effort as well as their compatibility with other tools. In the meantime, it has become clear that Fluentd and Logstach are not suitable for our project. Data can be transferred from the FTP server to HDFS using Apache Flume. However, there are some problems with the configuration of Apache Flume. Apache Kafka is still under investigation. Some team members have found a new tool called Gobblin and are investigating it. As a backup solution, a team member has also written a script himself which can easily transfer the XML files from the FTP server to HDFS.
It is decided to use a self-written parser to parse the XML files. The Python programming language is used for this. In addition to programming the parser, a relational data structure is already being designed. This is to be used to create hive tables. It is planned that the parser will create corresponding csv files that will be read into the hive tables.
Our supervisors have pointed out that Apache Hadoop does not work as efficiently with many small files as it does with larger files. We will therefore try to combine several small files into one large file. We were also given the tip to use the Hadoop Streaming and MapReduce applications to parse the XML file. We are also investigating this.
We have also noticed that although we want to work according to SCRUM, we have not yet carried out a proper sprint review or retrospective for the sprints to date. Our sprints have a duration of two weeks and we are already in the fourth sprint. We want to change this at the end of the current sprint.
13.01.2016
After the Christmas holidays, we will continue to investigate which of the various components of the Apache Hadoop ecosystem are best suited to our project. We are looking at the following components in detail:
Apache Kafka, Elasticsearch, Logstash, Fluentd, Apache Storm and Apache Flume are being investigated for extracting and loading the data. With Apache Storm, we encountered configuration problems in the Apache ZooKeeper area. The configuration of Apache Storm and Nimbus was successful. Elasticsearch is not suitable for our requirements. The tools Logstach, Apache Flume and Fluentd have relatively similar performance characteristics, which is why we will now compare the installation and configuration effort as well as the compatibility with other tools.
The tools Apache Hive, Apache Pig and Apache Spark are analysed for transforming and parsing the data. Apache Spark is analysed with the different programming languages Java, Scala and Python. So far, it has been shown that Apache Pig is unsuitable for our purpose because it does not allow individual attributes to be read from the XML file. Apache Hive and Apache Spark can be used to parse the data for our purposes. However, since Apache Spark is easier to configure and use, we opted for it.
In addition to the component selection, part of our group is currently working intensively on the XML files. The aim is to create documentation for understanding the XML files. At the moment, we do not yet know what each individual element and each attribute stands for or what information is hidden behind them. In order to be able to say this in the future, documentation will be created and discussed with the employees of the practice partner. This is a necessary basis for analysing the data in the future.
In addition to these activities, we have noticed that our current organisation has some potential for improvement. For example, JIRA is not yet being used to its full extent, which is reflected in the fact that tasks are not always assigned correctly and the tasks in the system are generally not kept up to date. Often the person responsible for development also tested the results themselves. As a result, the previous testing of results was not really possible. Now, as soon as a task is completed, it is assigned to another person who is also familiar with the subject matter of the task for testing. This gives the person working on the task more feedback. We want to achieve better results. For the tool to have a truly supportive effect, we need to pay more attention to keeping the tasks up to date and assigning them to the right people. Only then can it have a positive impact on our work results.
Outlook for 2016
After the project group began familiarising itself with the topic in 2015, the requirements were collected and the year ended with the evaluation of the Apache Hadoop components, the development of an initial prototype will continue in 2016.
Overall, 2016 is divided into two periods with regard to the project group. The first period runs until April and comprises the development of an initial prototype (alpha version). The second period is from April to September and deals with the development of the second prototype (beta version) as a result of the project group. The main difference between the two versions is that the alpha version runs on servers within the University of Oldenburg and the beta version is integrated into the system environment of CEWE Stiftung & Co. KGaA.
In the alpha version, we still have the opportunity to try out and test everything. We can install all the components and evaluate them from a wide variety of perspectives. And that's exactly what we'll be doing a lot of. In April, our system will then move from the servers of the University of Oldenburg to the servers of CEWE Stiftung & Co KGaA. The beta version will then take place here. The components that we ultimately want to use should therefore be finalised by April, as no components are to be tested in the beta version.
We have planned to start analysing the data in March. We distinguish between two forms of analysis. Firstly, analytics per se and secondly, data mining. By analytics, we mean the pure determination of key figures. By data mining, we mean all analysis topics that are more complex.
However, in order to be able to analyse the data at all, we have to extract the XML data, load it into the Hadoop Distributed File System (HDFS) and transform the data. We started working on this topic in December 2015 and will continue to do so in January and February 2016.
Review of 2015
The project group began in October 2015. The first step was to get to know the other project group participants and the supervising lecturers. We then got to know our contacts at CEWE Stiftung & Co KGaA and were introduced to the DigiFotoMaker (DFM for short). The DFMs are kiosk devices from CEWE Stiftung & Co KGaA, which are set up at its retail partners such as Müller Großhandels Ltd & Co KG. Photo products can be ordered from these devices and printed out directly on site. When a DFM is operated, information on operating behaviour is saved in an XML file in addition to technical data. Our task is to be able to analyse this XML file with the help of Apache Hadoop.
Before we could get down to the actual work on our project, we had to agree on how we wanted to organise ourselves over the next year. We very quickly agreed on SCRUM as the method for software development because we all know it, are familiar with the process and feel comfortable with it. We chose Jira with the associated documentation tool Confluence as a tool for better organisation. We also use GIT for file exchange. Some roles were also allocated, so we now have two project managers, two server officers, a document manager and a website manager.
Once the organisational aspects had been clarified, the requirements were worked out and defined in detail. We identified three core requirements, which are briefly mentioned here. A detailed description can be found under "Project".
- "Near Time" analysis of important key figures
- Alarm system that recognises failures and downtimes
- Evaluation of operating behaviour to optimise the ordering software
The next step was to evaluate which components of the Apache Hadoop ecosystem we would use. To do this, the PG split into two subgroups. The first subgroup analysed components for interface extraction and loading data into the Hadoop Distributed File System (HDFS). The second subgroup focussed on components for transforming the XML files. The classic ELT process was thus split up. On the last date before the Christmas holidays, the individual subgroups presented the interim status of their results.
10.12.2015
The first step in Hadoop is to decide on the components we need. Because Hadoop is very comprehensive and has many different but similar components, we need to test and analyse some of them in more detail. A scientific approach is particularly important here in order to be able to make a decision. To this end, the project group has split into two subgroups. The first subgroup deals with components for interface extraction and data loading. The second subgroup deals with components for transformation strategies. The ELT process was thus divided.
Once we have selected the components, we want to move on to the next phase. This means that we will then deal with the topic of analytics. This includes data mining and machine learning.
The plan is to coordinate the final selection of the components to be installed with CEWE Stiftung & Co KGaA. Although we will initially install everything internally at the university, we plan to move the architecture to CEWE Stiftung & Co. KGaA's system landscape in spring 2016.
17.11.2015
After the DASH PG had spent several weeks analysing possible use/business cases in detail, the specific goals were agreed with the CEWE Stiftung & Co. KGaA supervisors. The following objectives were defined:
- Alarm system: This term caused some confusion, which is why a precise definition was urgently needed in our glossary. It does not refer to a pure ticketing service that issues a message when a warning or alarm is present. By this we also mean the logic that is required to recognise a warning or alarm in the first place.
- Operational dashboard: This is primarily used to give the PG access to the specific XML data. The dashboard provides an overview of some key figures, such as products ordered per day.
- Clickstream analysis: By clickstream, we mean the anonymised log of clicks made in the ordering software on the Digi-Foto-Maker. By analysing this data, the user-friendliness of Digi-Foto-Maker can be improved. It is also possible to recognise whether certain paths always lead to cancellations or whether certain combinations are made frequently.
Our work will focus more on the development of an ETL process and data management in Apache Hadoop. CEWE Stiftung & Co. KGaAit is less important which data we analyse exactly or how we prepare the results visually.
A workshop in cooperation with CEWE Stiftung & Co. KGaA is planned for January to help us better understand the XML data. We will also be given access to a Digi-Foto-Maker, where we can create some test cases ourselves. CEWE Stiftung & Co. KGaA has also invited us to give a presentation at the so-called "CEWE Analytics Forum". This is an internal CEWE Stiftung & Co KGaA event at which our project would even be presented to the Board of Management.
25.10.2015
In addition to answering the question "Where is the journey going?", we have already started to think about how we would actually like to get there within the next year. To this end, the PG has agreed on a number of organisational points. In some cases, these agreements were not always entirely clear and there was a lot of back and forth discussion.
Where we agreed very quickly was on the Scrum approach with orientation to CRISP DM. In both cases, we will customise the procedure to our individual needs. With regard to Scrum, we have currently agreed on two-week sprints. Our weekly meeting counts as a weekly scrum instead of a daily scrum.
As a further organisational point, we have decided to use Jira and Confluence. Jira is used for task management and Confluence for documentation. We are also setting up a GIT, although there is still some confusion as to what this is good for when we also have Confluence. After all, we don't need to document in both systems at the same time.
To break things up a little, we organised a photo shoot. You can see the photos under "The team". This will give you an idea of who PG DASH and "we" actually are. A photo shoot will follow in January, where we will also take a picture with the supervisors from both the University of Oldenburg and CEWE Stiftung & Co KGaA.
After some back and forth and many creative phases, we have now also decided on our PG name "DASH" (Data Analytics with Hadoop) and our logo.
18.10.2015
In the first few weeks, we are in the process of gathering the exact requirements. This proves to be not so easy. At the beginning, we only received a rough target. Otherwise, our client CEWE Stiftung & Co KGaA gives us a lot of freedom to work.
In order to determine where the journey should take us, we will specify concrete goals. To do this, we first considered possible use cases. We have also thought about the requirements from both a technical and a user perspective.
We are planning a meeting with CEWE Stiftung & Co KGaA next week. We want to present our results so far and discuss with our clients which use cases from our longlist should be realised if possible. In addition, some questions have arisen during our deliberations so far that we still need to have answered.
In addition to the requirements, everyone is still writing their seminar paper. We will be presenting these to each other soon. The results will also be available from December under "Seminar papers".
13.10.2015
In the second week, we visited the company CEWE Stiftung & Co KGaA. Here we were introduced to our contacts and the photo stations, also known as Digi-Foto-Makers (DFM for short). The DFMs are the kiosk devices from CEWE Stiftung & Co. KGaA, which are set up at retail partners such as Müller. These machines can be used to place photo orders or print out photo products immediately on site. When operating a DFM, the operating behaviour and technical data of the DFM, such as the temperature of the printer, are saved in an XML file.
Our task for the coming year is to be able to analyse these XML files with the help of Apache Hadoop. In order to realise this, we first need to familiarise ourselves with Apache Hadoop and set up such a system. In addition to this introduction to the topic, CEWE Stiftung & Co. KGaA has already given us initial requirements and possible analysis topics that we should implement in the course of the project. This was a "near time" analysis, an alarm system to detect failures and downtimes and to be able to further optimise the ordering software by means of a detailed evaluation of the operating behaviour.
04.10.2015
The first project group meeting was held internally at the university with the PG participants and the supervising lecturers Andreas Solsbach and Viktor Dmitriyev. Everyone briefly introduced themselves and organisational matters were clarified.
For the second meeting, we then tried to find a common weekly Jour-Fix date. We created a Doodle survey for this purpose. However, it is very difficult to agree on a time with 12 people + 2 lecturers. In addition, a room at the university has to be free at this time. Unfortunately, we couldn't avoid the fact that some participants had to cut back. Because all of us students still work part-time, we had to see what times were left for us when nobody had to work. Then we also took into account the wish of our lecturers that the event should take place sometime between 8am and 4pm. In the end, we were able to come to an agreement. Nevertheless, a few students will now be unable to attend other lectures or even tutorials.
01.10.2015
We are the PG DASH at the University of Oldenburg.
Our project group (PG) is called DASH - Data AnalyticS with Hadoop, because we will be working with Apache Hadoop for a year. We are working together with the company CEWE Stiftung & Co KGaA, which is known for its photo services, among other things. You can find out more about this in the "The project" section.