Tuesday 25 November 2014

Dealing with RDDs on Spark

In have uploaded on GitHub a sample code of Apache Spark most common Operations and Actions over RDDs. It also covers examples such as reading and writing to files (text, sequence), a word count function and a simple PageRank implementation.

Download Repository

This code has been written in Java and compiled with Maven. I have been following Holden Karau's book: "Learning Spark. Lightning-Fast Big Data Analytics" which gives very useful advice about Spark engine.


Tuesday 5 February 2013

Export Microsoft Office Graphics to EPS Format for LaTeX

Most of us use MS PowerPoint, Word or Excel for making presentations and technical reports. From time to time, we need some of the created graphics and charts to be added into our publications, generally using a LaTeX editor. 

In the last few months I have been trying to find an easy way to convert MS Office Graphics into some LaTeX compatible format. After experimenting with several options over the web I did not manage to solve the issue, particularly because most of the options require either licensed software such as Adobe Professional or some unreliable Powerpoint-to-LaTeX converter from the web. This latter also does not produce real vector graphics but instead a vectorized version of a bitmap which in the end does not offer an optimal quality for publication.

The following method is easy, does not require licensed software and produces great vector graphics directly into EPS format in just a few steps. This procedure can be done in MS Word or PowerPoint (2010) and exploits the not commonly used 'Print to File' option for Printing.

A printer driver that support PostScript Options must be installed in your computer. This will be clarified bellow. If you do not have a printer it is not a problem and you can for instance install the HP Color LaserJet CM1312 Printer (which is the one I have) by following this link. (Choose your right operating system and either download the full driver or the Windows PostScript driver). Once ready, open the installer and follow installation process as a Local Printer.

Converting from MS Office to EPS Format:


  • Create an new word file. File->New->Blank Document 
  • Select and copy the graphic object to be converted from its original file (Word o PowerPoint) and paste it into the empty word file. 

  • While the object is selected in the new file, click on Wrap Text in the Picture Tools Format tab and then select In Front of Text for easily manipulating the object.

  • Resize the object in proportion to your size requirements on your latex document. The file size will be directly dependent on this. In particular for .eps files which are generally large. I recommend you to locate it at the top left corner of the word document. (This is not required but it will be very helpful as at the end of the process it will be clear why it is needed). 

  • Go to File->Print and click on the printer drop down menu. Notice the Print to File option and tick it.

  • Then select the desired printer and click bellow the drop down menu in the Printer Properties
  • Look for the Advanced options tab in the new window and click on it. 
  • In the Advanced options tab, Document Options->PostScript Options->PostScript Output Option choose Encapsulated PostScript (EPS) and then click OK. Finally click OK again to exit the Printer Properties

  • Press Print.
  • Select the file location in the Print to File window and type its file name. Notice that the Save as Type is not neccesarily EPS so type the full file name including its extension within quotes (""), for example: "figure1.eps".

  • The EPS file should look like this from GhostView:

  • Go to your LaTeX editor and locate in the Figure Float where the graphic will be inserted. There should be a text similar to this: 
\begin{figure*}
     \begin{center}
          \includegraphics[width=1.0\textwidth]{figure1.eps}
     \end{center}
     \caption{My first real EPS figure.}
     \label{fig:eps-figure}
\end{figure*}
              • Add the following instructions to the \includegraphics command (notice the highlighted elements as they are dependent or your object size):
              \includegraphics[trim = 0cm bottomcm rightcm 0cm, clip, width=1.0\textwidth]{figure1.eps}
              • Adjust the bottom and right values of the trim attribute to fit properly in you latex document the graphic object by trimming its sides. Compile the tex file and preview the output. (Repeat this step if necessary).
              Figure output before and after trimming.

              The generated EPS file is actually formatted with the predetermined MS Word document sheet size (e.g. A4, Letter). Notice that I have not made any change regarding these settings in the word processor   This is because, at least in my PC, the PostScript conversion was not taking into account the sheet size and always producing the same output, so I opted for a very easy solution which was to trim the output directly in LaTeX to fit EPS graphic into the figure. Additionally I did not modify the default MS Office page Orientation (Portrait, Landscape) because it was making no change either, so I suggest you to stick with the defaults to start with. 

              Observe that in the \includegraphics command that the trim parameters the left and top attributes are zero and the other two need to be adjusted manually by each person depending on the aspect of his/her graph. This explains why I suggested to position the object in Word on the top left corner: to avoid playing with the 4 parameters.  

              This method has been tested only a few times and surely there is space for improvement. Please let me know about your experience using it and the different approaches employed to tackle the problem.