[CITASA] seeking suggestions about research methodology for online forums
Christine Morton
christine at christinemorton.com
Thu Feb 25 21:34:05 CST 2010
Wow. Thank you for the question, Tim, and for sharing the outline of your
research. It sounds very interesting and relevant. Would you be interested
in sharing more details with me offlist? and thanks Tom for your thoughtful
and thorough reply. I¹ve learned a lot.
Regards,
Christine
On 2/25/10 6:54 PM, "Thomas M. Lento" <thomas.lento at gmail.com> wrote:
> There are all kinds of supervised and semi-supervised content analysis
> approaches that might be applicable - their effectiveness depends on the
> nature of your data, the level of detail you require, and your ability to
> design training data. For example, if you have a superset of health
> information topics and keywords associated with them you could probably do a
> simple keyword analysis to give you an idea of which posts are talking about
> which topic. The advantage is that's pretty easy to do even if the data is
> stored as text blobs in your MySQL database. However, if you need to do
> serious contextual analysis or you don't already know the scope and range of
> possible health topics and keywords that's a more difficult problem. You'll
> probably need to try some different machine learning approaches and see what
> works best for you needs. You can do literature searches for papers published
> at ICWSM, KDD, or WWW for examples of some approaches. There are other
> conferences that will be applicable, but those are good places to start.
> Follow the citation trail from the relevant pieces and you should get some
> idea of where to look for useful references.
>
> Network analysis of these things can either be pretty straightforward,
> moderately complicated, or totally impossible. It all depends on how your data
> is structured and what type of information is available. If you've got a
> standard relational database structure designed to drive the content in the
> online community then at best you're in for some work to reformat the data
> tables into something useful for actual analysis. What you hope for are tables
> with various combinations of userid, postid, time stamp, user_data, and
> post_data. You'll probably need to join several tables to get the actual data
> files you need, and although MySQL is not optimized for joins it should still
> be manageable.
>
> Assuming you're new to this type of analysis, my advice is to spend a lot of
> time familiarizing yourself with the underlying data. Do some basic
> distributions and see how much noise you've got in the system, find out the
> most efficient routes to generating the output you need, and discover what
> information is available in which table and how those tables are keyed and
> indexed. Make sure you're on the lookout for garbage data - a lot of the great
> data you get from various online sources is basically bad, either because of
> system errors (rare and usually easy to find) or "bad" users (common and not
> so easy to find - there's a whole literature on spam detection algorithms out
> there). You'll need to make decisions about your error tolerance and what
> types of behaviors you wish to ignore, and you can only do that effectively if
> you understand your data.
>
> I don't know how much experience you have with database queries, but assuming
> you can only handle moderately complex queries my advice is to use the
> database as a source for your final dataset and then conduct your analysis in
> some other tool. My typical approach to this situation is to write database
> queries that produce flat text files with one row per observation (typically
> per user, or per user/time_period combination, but this obviously depends on
> your research question) with one column for each metric. Then I load the data
> into R or Stata or whatever else and build models. You can do a fair amount of
> work in MySQL, but this is typically slower and more difficult than exporting
> and using an actual statistical package.
>
> The drawback of exporting data is that you're limited to whatever your stats
> package can hold in memory. If your data set is large (hundreds of thousands
> or millions of observations) then you need a fair amount of memory to run any
> kind of complex model. If you're dealing with 10s or 100s of millions of
> observations in your model then things get really interesting - I suggest
> sampling, but there are other more difficult options.
>
> If you don't know MySQL at all, you need to learn it. There are a plethora of
> books on MySQL out there - I like O'Reilly for reference and SAMS for
> instruction, so if you get something from one of those publishers you should
> be ok. You will also want to learn how to do some scripting in Python or Perl.
> For the rank beginner, I recommend going with Python and learning by working
> through the chapters and exercises in How To Think Like a Computer Scientist
> (free online at http://www.greenteapress.com/thinkpython/thinkCSpy/html/ ). If
> you know how to program in general then diveintopython.org
> <http://diveintopython.org> is your best bet. I'm not a Perl guy, but I'm
> sure someone can point you to resources.
>
> For the actual network analysis, I'd first look into NodeXL since it's easy to
> use, and if that doesn't meet your needs I'd go with something like igraph
> (see http://igraph.sourceforge.net/ ), which works with both R and Python.
>
> Best of luck.
>
> -Tom
>
> On Thu, Feb 25, 2010 at 6:14 PM, Tim Hale <timhale at uab.edu> wrote:
>> Hi everyone,
>>
>> I am working with others on a project that will examine health communication
>> among members of an online community who post to an online forum. We have
>> three primary goals: (1) to conduct a content analysis to understand the
>> types of health information that is communicated; (2) identify the context of
>> the discussions, including understanding the characteristics of the
>> individuals who initiate and disseminate health information; and (3) to
>> conduct a social network analysis to examine the larger structures of health
>> information sharing among community members.
>>
>> Although this type of research could be conducted by manually collecting
>> posts from the online forum, coding for content, and the creation of a data
>> set for social network analysis -- we are interested in other approaches that
>> make better use of the forum database files. We have the cooperation of the
>> website owner and administrator to access the MySQL database.
>>
>> I am seeking advice from anyone with experience working on similar research
>> questions involving online forums and especially, making use of the original
>> forum database files. All recommendations, suggestions, and pointers to
>> articles, books, and appropriate tools are welcome and greatly appreciated.
>>
>> Thank you,
>> Tim Hale
>>
>> ------------------------------------------------------------
>> Timothy M. Hale, MA
>> University of Alabama at Birmingham
>> Department of Sociology
>> Heritage Hall 460E
>> 1401 University Boulevard
>> Birmingham, AL 35294-1152
>> 205.222.8108 (cell)
>> timhale at uab.edu
>>
>>
>> _______________________________________________
>> CITASA mailing list
>> CITASA at list.citasa.org
>> http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org
>
>
>
> _______________________________________________
> CITASA mailing list
> CITASA at list.citasa.org
> http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org
CMQCC: Transforming Maternity Care
Christine H. Morton, PhD
Program Manager/Research Sociologist
California Maternal Quality Care Collaborative
Stanford University p. 650-725-6108 f. 650-721-5751
Medical School Office Building d. 650-721-2187 c. 650-995-4550
251 Campus Drive
Palo Alto, CA 94305-5415
cmorton at stanford.edu www.cmqcc.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.citasa.org/pipermail/citasa_list.citasa.org/attachments/20100225/57d3d446/attachment-0001.html>
More information about the CITASA
mailing list