News:

Yahoo Groups closing on Dec 14th 2019

Main Menu

Formatting of Exported Messages using Third Party App

Started by casstk, December 15, 2015, 04:14:31 PM

Previous topic - Next topic

casstk

I have been testing the SQL Lite Data tool and exporting the messages in various formats. Thanks some of Wilson's tips the messages are exporting - but they are always exporting in html code.  Is there an easy way to get them converted into readable format?  I tried exporting them into html and opening them with Chrome/Firefox but the html formatting remained. I tried various other formats (CSV, Text etc) with the same result.  Of course, since the messages are originally stored in html format this makes sense.

sample: http://www.awesomescreenshot.com/image/839115/65d7cf37e6b737a1caec3e5290b033a7

I've tried the 'Show as Digest" and "Print" from within  PGOffline but for larger mailing lists that is proving problematic.

Wilson Logan

Dude... JFGI... :)

http://www.w3.org/Tools/html2things.html

Cheers,

Wilson.

casstk

#2
Excellent question - I need to be able to archive the messages in (a) a format I can access them a few years down the road and (b) that is end user readable. RTF, PDF etc -  basically a universal format. I have used the print to PDF function built into PGOffline, but for large mailing lists, it takes too long to print the pages.  Since I am your very basic end user, my ability to work some of the SQL tools is limited. Thanks to your tutorial I was able to get SQL Lite working but now am stumped by the HTML formatting of the text.

edited: I mean to say I know how to convert the html portion of the messages. But I need a way to export the full page which includes the columns and the html formatted messages.  Here is an example of what I get when I run the "exported page" through an HTML --->  text  converter.


id number date subject content discussion_group
157606 1 7/14/2011 7:31 PM New file uploaded to ChamberMusicandMemories  <div id="ygrps-yiv-891255341">Hello,<br/> <br/> This email message is a notification to let you know that<br/> a file has been uploaded to the Files area of the ChamberMusicandMemories <br/> group.<br/> <br/>   File        : /Registration & Hotel Information.pdf <br/>   Uploaded by : sheliak54 &lt;<a rel="nofollow" target="_blank" href="mailto:sheliak54@...">sheliak54@...</a>&gt; <br/>   Description : Registration and Hotel Information <br/> <br/> You can access this file at the URL:<br/> <a rel="nofollow" target="_blank" href="http://groups.yahoo.com/group/ChamberMusicandMemories/files/Registration%20%26%20Hotel%20Information.pdf">http://groups.yahoo.com/group/ChamberMusicandMemories/files/Registration%20%26%20Hotel%20Information.pdf</a> <br/> <br/> To learn more about file sharing for your group, please visit:<br/> <a rel="nofollow" target="_blank" href="http://help.yahoo.com/l/us/yahoo/groups/original/members/web/index.html">http://help.yahoo.com/l/us/yahoo/groups/original/members/web/index.html</a><br/> Regards,<br/> <br/>

Wilson Logan

That's why I suggested RTF. It should resolve all that.

You could always use Digest mode & then copy & paste (see attached).

You can vary the number of messages per page.  I set mine to 5000 messages per page without too much drama.

Given that it only occupied 360k and I have a 6GB machine I reckon I could have comfortable increased that to 10000 a page.

Your mileage may vary...

Cheers,

Wilson.

Wilson Logan

OK, I tried 50000.  That was a leap too far. It got to 1.2GB memory used and failed.

So for my machine at any rate I would say the limit is about 40000 per page.

Cheers,

Wilson.

casstk

This is why I thought RTF would be the format to use when exporting from MySQL Lite. I'll make a separate post with my workflow because I suspect it may clarify.

The copy and paste method works but with my machine I can get up to 2000 messages before PGOffline crashed. I have a 16 GB machine and at 2000 messages per page it used around 400kb.


casstk

SQL Lite export

1. Open the pg4 file (I exported just one mailing list to limit the size) by selecting "Data Export"
2. Destination format - Other - RTF
3. Add table or view - Group Message
4. Fields - I only need a few "id, number,date, subject, and content"
5. Export.

The text in the "content" field is still html code. If I open the rtf in word or firefox or word or notepad etc, the html code remains.,.html

I really am a novice when it comes to formatting and databases.  I know just enough to push a button. Especially if it is red.

Wilson Logan

That's funny... I did 20000 with no issues on an 8gb machine.

I can see I am going to have to implement a bomb proof method of doing this.

Watch this space...

Wilson Logan

Right... how about this:


What I envisage is a function called from the context menu of the group list pane.

So you right click on a group e.g. Hobbicast_coffee_lounge and there is a new option "Export messages to file". 

This starts a pop-up window with the following selections:


----------------------------------------------------------------------------------------------------------------------------------------------------


Export messages by

0       Date range                       (DD MM YYYY)  to  (DD MM YYYY)

0      Message number range     (nnnnnnn)  to  (nnnnnnn)

0      All messages



Group messages by

0      Author

0      Subject

0      Topic ID

0      No grouping



Max Number of messages per file   (nnnnnnn)


Folder to save to (____________________________)  Browse


File Prefix  (______________________________)



                       <OK>             < CANCEL>


--------------------------------------------------------------------------------------------------------------------------------------------------

   

BTW   "0"   means a radio button i.e. choose one of these options from the group of options.



So you might choose to export messages  1 to 1000 and then to group by Author.  So for each author, within that 1000 records,  a file is created of that authors messages.

e.g. if you choose  to save to folder  C:/Temp/

then choose a file prefix of  "HCL"   

then the files that are created are: 

   C:/Temp/HCL-<author-1>.txt


   C:/Temp/HCL-<author-2>.txt


   C:/Temp/HCL-<author-3>.txt


... etc.

where author-1 is the first author and author-2 is the second author etc.

e.g.


   C:/Temp/HCL-evildrome_boozerama.txt


   C:/Temp/HCL-Richard_Spelling.txt

 

If the user chooses  Topic ID instead of Author then the files that are created are: 


   C:/Temp/HCL-<Topic-ID-1>.txt


   C:/Temp/HCL-<Topic-ID-2>.txt


   C:/Temp/HCL-<Topic-ID-3>.txt


e.g.


   C:/Temp/HCL-11263.txt


   C:/Temp/HCL-11264.txt





If the user chooses  Subject instead of Author then the files that are created are: 


   C:/Temp/HCL-<Subject-1>.txt


   C:/Temp/HCL-<Subject-2>.txt


   C:/Temp/HCL-<Subject-3>.txt


I think subject should be the first 20 characters of the subject e.g.   if subject in PGO is   "Casting Tesla Turbine end plates"  then the  file would be :

   C:/Temp/HCL-Casting_Tesla_Turbin.txt


Also I guess there's an issue also with non-allowable characters in the Subject line like "/"  and "@"  and "*" .   I think we should just replace any non-allowed character with "_"


That is valid also for non-allowed characters in the Author name.

 
Cheers,

Wilson.

casstk

I really like the options of sorting the exported messages by author, subject etc.  For most of the mailing lists, I'd want to the default to be sort by date (as they show up in your Viewer), but for a few mailing lists where people posted multiple posts in sequence, sorting by subject would be great.

Once more novice question - since the hmtl coding is built into the "content" tables, when you export to text, will the html coding be there?  The goal is to export into (a) human readable format that is (b) universal (text, word, pdf, rtf etc)  Now since .txt files cannot "interpret" html characters, wouldn't I'd end up with something very much like I had when using SQLite?

casstk

I passed along your thoughts to a friend who works with databases. This is what he suggested

".... export in tab-delimited format, not comma-delimited.  He already suggests in that latest blog post, that he pre-process disallowed characters into underscores.  If that included underscore replacement of any tab characters in the data (as opposed to tab markup codes which are downstream useful), then the resulting txt files would load cleanly".  I am not certain what this all means, but it may be helpful..


Wilson Logan

I have suggested to Anton that we use the View as Digest engine to generate the text file.

So if you look at what comes out of "View as Digest" (minus the Digest heading), that's what you'll get in a text file.

e.g.


Group: Hobbicast_coffee_lounge Message: 11255 From: mwbeaty2000 Date: 06/01/2013
Subject: Re: Fwd: Spam
Looks like they got you Dan

Mike B.



Group: Hobbicast_coffee_lounge Message: 11256 From: Dan Brewer Date: 06/01/2013
Subject: Re: Fwd: Spam
Scanned computer. Changed passwords.

Sent from my iPhone



Group: Hobbicast_coffee_lounge Message: 11257 From: mwbeaty2000 Date: 08/01/2013
Subject: Re: Fwd: Spam
Ain't it fun. They got me a couple of months ago

Mike B.



Group: Hobbicast_coffee_lounge Message: 11258 From: mwbeaty2000 Date: 08/01/2013
Subject: 
Did you hear she was going to have neck surgery? No date set yet, the last i heard

Mike B.



Cheers,

Wilson.


SMF spam blocked by CleanTalk